Overview

scLDM.CD4 is a specialized generative model for simulating the transcriptomic effects of single-gene perturbations in CD4+ T cells, developed at the Chan Zuckerberg Initiative as part of the Virtual Cells Platform. It represents a domain-specific fine-tune of scLDM (Palla et al., arXiv:2511.02986), the base scalable latent diffusion model for single-cell gene expression generation. While scLDM provides a general-purpose generative framework, scLDM.CD4 concentrates that capacity on one of the most immunologically important and therapeutically relevant cell types in the human body, enabling high-precision simulation of genetic interventions in naïve CD4+ T cells.

CD4+ helper T cells are central regulators of both humoral and cellular immune responses. Their differentiation into distinct effector and regulatory subsets — including Th1, Th2, Th17, and Treg lineages — is governed by complex gene regulatory networks that are frequent targets of therapeutic intervention in autoimmunity, chronic infection, and cancer immunotherapy. Understanding how disrupting individual genes affects CD4+ T cell fate is a critical step in identifying drug targets, but systematic Perturb-seq screens across thousands of genetic interventions are expensive and time-consuming. scLDM.CD4 addresses this bottleneck by allowing researchers to predict the transcriptomic consequences of knockdowns in silico, at a scale and speed that complements experimental screens.

The model was released as version 0.1 on CZI's Virtual Cells Platform alongside the base scLDM model, and it is hosted as part of the broader Virtual Cells ecosystem intended to provide researchers with accessible, specialized generative tools for distinct biological contexts.

Key Features

Massive Perturb-seq training set: scLDM.CD4 was trained on approximately 14.5 million CD4+ T cells from a large-scale single-gene knockdown Perturb-seq dataset. This training corpus covers hundreds of genetic perturbations at high cell-number coverage per condition, providing the model with detailed information about how individual gene losses propagate through the CD4+ transcriptional network.
Counterfactual perturbation generation: The model can generate synthetic CD4+ T cell profiles conditioned on the identity of a knocked-down gene, enabling in silico prediction of how depleting a specific gene will shift the cell's transcriptional state. Researchers can then use these predictions to rank perturbations by their predicted effect on target gene expression programs.
Two-component architecture: Like the base scLDM, scLDM.CD4 consists of a transformer-based variational autoencoder (using Multi-head Cross-Attention Blocks for permutation-invariant gene encoding) and a conditional Diffusion Transformer that generates latent cell profiles using flow matching. The CD4-specific fine-tuning adapts both components to the cell-type-specific expression landscape and perturbation response patterns of CD4+ T cells.
In silico perturbation ranking: The model supports computational prioritization of candidate genetic interventions toward a desired transcriptomic phenotype in naïve CD4+ T cells, enabling data-driven selection of perturbation targets before costly experimental validation.
Transferable representations: Beyond direct generation, embeddings from the scLDM.CD4 encoder serve as learned representations of CD4+ T cell states that can be applied to downstream tasks including cell-state classification, data augmentation for predictive models, and evaluation of experimental perturbation datasets.

Technical Details

scLDM.CD4 shares its architectural design with the base scLDM: a two-stage generative model combining a permutation-invariant variational autoencoder with a Diffusion Transformer trained via flow matching. The VAE encoder compresses single-cell RNA-seq count profiles into fixed-size latent representations using Multi-head Cross-Attention Blocks (MCAB), which perform permutation-invariant pooling over gene-expression token pairs. The VAE decoder applies permutation-equivariant unpooling to reconstruct expression values. The Diffusion Transformer then learns to generate the latent distribution using linear interpolants as the training signal, with multi-conditional classifier-free guidance enabling conditioning on cell context attributes and perturbation identity.

The training data for scLDM.CD4 consists of a pre-processed CD4+ T cell Perturb-seq dataset comprising approximately 14.5 million cells subjected to individual single-gene knockdowns. This scale substantially exceeds typical Perturb-seq datasets and provides dense coverage of the CD4+ perturbation landscape at multiple genes of immunological interest. The conditioning mechanism encodes perturbation identity as a categorical covariate passed to the Diffusion Transformer via the classifier-free guidance framework, allowing the model to generate cells from both the perturbed and unperturbed distributions and to compute the predicted transcriptomic shift induced by each knockdown. No specific benchmark figures for scLDM.CD4 performance have been reported separately from the base scLDM evaluation at the time of this writing, as the model was released at version 0.1 alongside the base model preprint.

Applications

scLDM.CD4 is targeted at researchers working in T cell immunology, autoimmunity, and cancer immunotherapy who need to evaluate the potential transcriptomic consequences of genetic interventions in CD4+ T cells. In drug target discovery, the model enables computational ranking of hundreds or thousands of candidate gene knockdowns by their predicted effect on transcriptional programs associated with desired immune outcomes — such as suppressing inflammatory cytokine production or promoting regulatory T cell differentiation — before experimental validation of the top candidates. The model also supports data augmentation for downstream predictive tasks where labeled CD4+ T cell perturbation data is limited. Researchers can use scLDM.CD4-generated profiles to enrich training datasets for classifiers predicting perturbation outcomes, cell-state transitions, or drug response phenotypes. The model is accessible through the CZI Virtual Cells Platform, making it usable without requiring specialized infrastructure.

Impact

scLDM.CD4 represents an early example of a domain-specific generative virtual cell model: a general-purpose foundation (scLDM) adapted to a specific, therapeutically important cell type with a specialized large-scale training corpus. This specialization approach is important because the biology of CD4+ T cells — including their epigenetic state, lineage-specific transcription factor networks, and cytokine response programs — differs substantially from bulk or pan-cell-type training data, and domain-specific fine-tuning captures nuances that a general model may average away. The training set of 14.5 million perturbed cells represents one of the largest single-cell perturbation corpora used for generative model training, providing coverage of the CD4+ perturbation landscape that complements experimental Perturb-seq resources. As the Virtual Cells Platform matures, scLDM.CD4 is positioned as a component of a growing library of cell-type-specific and context-specific generative models designed to make large-scale in silico perturbation studies tractable for the broader research community. Current limitations include restriction to single-gene knockdown perturbations in naïve CD4+ T cells, and early-stage v0.1 status indicating that validation across diverse experimental contexts is ongoing.

Overview

Key Features

Massive Perturb-seq training set: scLDM.CD4 was trained on approximately 14.5 million CD4+ T cells from a large-scale single-gene knockdown Perturb-seq dataset. This training corpus covers hundreds of genetic perturbations at high cell-number coverage per condition, providing the model with detailed information about how individual gene losses propagate through the CD4+ transcriptional network.

Counterfactual perturbation generation: The model can generate synthetic CD4+ T cell profiles conditioned on the identity of a knocked-down gene, enabling in silico prediction of how depleting a specific gene will shift the cell's transcriptional state. Researchers can then use these predictions to rank perturbations by their predicted effect on target gene expression programs.

Two-component architecture: Like the base scLDM, scLDM.CD4 consists of a transformer-based variational autoencoder (using Multi-head Cross-Attention Blocks for permutation-invariant gene encoding) and a conditional Diffusion Transformer that generates latent cell profiles using flow matching. The CD4-specific fine-tuning adapts both components to the cell-type-specific expression landscape and perturbation response patterns of CD4+ T cells.

In silico perturbation ranking: The model supports computational prioritization of candidate genetic interventions toward a desired transcriptomic phenotype in naïve CD4+ T cells, enabling data-driven selection of perturbation targets before costly experimental validation.

Transferable representations: Beyond direct generation, embeddings from the scLDM.CD4 encoder serve as learned representations of CD4+ T cell states that can be applied to downstream tasks including cell-state classification, data augmentation for predictive models, and evaluation of experimental perturbation datasets.

Technical Details

Applications

Impact

scLDM.CD4

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

scLDM.CD4

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources