Chan Zuckerberg Initiative
A scalable latent diffusion model for generating realistic single-cell gene expression profiles, using a permutation-invariant VAE and flow-matching diffusion transformer.
scLDM (single-cell Latent Diffusion Model) is a generative model for single-cell RNA sequencing data developed by Giovanni Palla, Sudarshan Babu, Payam Dibaeinia, James D. Pearce, Donghui Li, Aly A. Khan, Theofanis Karaletsos, Jakub M. Tomczak, and collaborators at CZ Biohub and the Chan Zuckerberg Initiative. The model was released as a preprint in November 2025 (arXiv:2511.02986) and is hosted on CZI's Virtual Cells Platform. scLDM addresses one of the persistent challenges in single-cell computational biology: generating synthetic single-cell gene expression profiles that are both statistically realistic and biologically interpretable.
The fundamental obstacle to high-fidelity single-cell generation is the nature of the data itself. Single-cell RNA-seq profiles are high-dimensional count vectors — typically tens of thousands of gene measurements per cell — with strong sparsity, overdispersion, and complex gene-gene dependencies. Prior generative models, including variational autoencoders (scVI, scVAE) and conditional diffusion models (scDiffusion), have made significant progress but often impose artificial gene orderings, rely on shallow architectures, or fail to capture the exchangeability structure of gene expression: unlike pixels in an image, genes have no intrinsic spatial order, and a cell's identity is determined by which genes are expressed, not by the position those genes happen to occupy in a matrix row.
scLDM resolves this by combining two purpose-built components: a permutation-invariant variational autoencoder that compresses gene expression profiles into compact latent representations while respecting the orderless nature of gene data, and a latent diffusion model based on Diffusion Transformers and flow matching that generates diverse, biologically coherent latent codes conditioned on metadata such as tissue type, cell type, and experimental perturbation. The result is a model that can simulate both observational single-cell transcriptomics and counterfactual perturbation responses with substantially improved fidelity over previous approaches.
scLDM is a two-stage generative model. In the first stage, a variational autoencoder compresses high-dimensional single-cell count matrices into fixed-size continuous latent representations. The encoder applies the MCAB, a multi-head cross-attention mechanism that operates over gene-expression pairs without assuming any positional structure, producing a permutation-invariant latent code. The decoder applies the inverse operation (permutation-equivariant unpooling) to reconstruct per-gene expression values. The VAE loss combines a count-based reconstruction term appropriate for sparse overdispersed RNA-seq data with a KL divergence regularizer.
In the second stage, a Diffusion Transformer (DiT) learns to generate the latent codes produced by the VAE encoder, using flow matching with linear interpolants as the training objective. Flow matching frames generation as learning a velocity field that maps a simple noise distribution to the data distribution along straight-line paths, providing a more efficient training signal than noise-prediction-based diffusion objectives. Conditional generation is achieved through classifier-free guidance, where the DiT is trained both with and without covariate conditioning and the conditioning signal is amplified at inference time.
Benchmark evaluations span both reconstruction and generation tasks. For cell reconstruction, scLDM achieves Pearson correlations with ground truth that are up to four times higher than baseline models, with the Human Lung Cell Atlas showing particularly strong improvements due to its high cell-type diversity. For unconditional and conditional generation, scLDM surpasses all publicly available single-cell generative models across multiple metrics, producing profiles with more realistic marginal distributions, more faithful gene-gene covariance structure, and better performance when used as training data for downstream cell-type classifiers. Validation datasets include Parse 1M (over 1.2 million immune cells exposed to 90 cytokines) and the Replogle CRISPR screen dataset (genome-scale knockouts across multiple cell lines).
scLDM serves researchers in single-cell genomics who need to generate synthetic training data, augment small experimental datasets, simulate counterfactual perturbation responses, or explore the gene expression space of a biological system in silico. In drug discovery and functional genomics, perturbation-conditioned generation allows researchers to predict the transcriptomic consequences of genetic knockouts or chemical treatments before running expensive experiments, supporting computational prioritization of candidate targets. In machine learning contexts, scLDM-generated synthetic cells can augment real datasets to improve the robustness of downstream classifiers — particularly in disease settings where labeled samples from rare cell populations are limited. The model's demonstrated performance on COVID-infected and liver cancer cell classification tasks establishes its practical utility beyond simple data augmentation, approaching the performance of specialized supervised models.
scLDM establishes a new state of the art for generative modeling of single-cell transcriptomics, achieving substantial improvements in reconstruction fidelity and generation quality over prior VAE-based and diffusion-based approaches. Its permutation-invariant design principle is a principled architectural contribution that other single-cell deep learning models could adopt to better respect the biological structure of gene expression data. The integration of flow matching into the latent diffusion framework provides an efficient training objective that is more computationally tractable than standard denoising diffusion for this data modality. As the base model underlying the specialized scLDM.CD4 fine-tune (which targets counterfactual perturbation in CD4+ T cells), scLDM represents a foundational component of CZI's generative virtual cell infrastructure. Ongoing limitations include the challenge of generalizing perturbation predictions to held-out genetic interventions not seen during training and the computational requirements of training the two-stage architecture on large cell atlases.