scDiffusion

Diffusion model for synthesizing single-cell RNA-seq data, with guided generation of specific cell types, rare cells, and developmental trajectories.

Released: January 2024

scDiffusion is a conditional generative model developed at Tsinghua University that synthesizes realistic single-cell RNA sequencing (scRNA-seq) data using a diffusion-based framework. It addresses a persistent bottleneck in single-cell genomics: the scarcity of high-quality training data for rare cell types and undersampled conditions. By combining a pre-trained foundation model encoder with a guided diffusion process, scDiffusion produces synthetic cells that accurately reflect the statistical and biological properties of real transcriptomes.

The model supports multi-condition generation, allowing researchers to specify biological constraints — such as organ type or cell type — simultaneously during sampling. A key innovation is the Gradient Interpolation strategy, which blends classifier guidance signals from two distinct conditions to generate intermediate cell states. This enables reconstruction of continuous developmental trajectories without requiring experimental data for those intermediate states.

Published in Bioinformatics in 2024, scDiffusion achieved a Spearman Correlation Coefficient (SCC) of 0.984 and a Maximum Mean Discrepancy (MMD) of 0.018 relative to held-out real data. Cell type classification of synthetic outputs using CellTypist reached 93% accuracy, confirming the biological plausibility of the generated transcriptomes.

Key Features

Conditional multi-factor generation: Separate classifier networks guide the diffusion process simultaneously across multiple biological variables — organ type, cell type, and other conditions — without interfering with the core denoising network.
Gradient Interpolation for trajectory synthesis: Blends gradient signals from two condition classifiers using adjustable weights to generate smooth, continuous developmental transitions between known cell states, including reprogramming and differentiation events.
Foundation model integration: Encodes raw gene expression profiles into 128-dimensional latent embeddings using SCimilarity, a foundation model pre-trained on 22.7 million cells, providing biologically grounded representations across diverse tissues.
Rare cell type synthesis: Generates underrepresented cell populations that are statistically difficult to sample experimentally, expanding the effective coverage of synthetic datasets.
Out-of-distribution generalization: Produces cells under condition combinations absent from training data, enabling exploration of hypothetical biological states prior to experimental validation.

Technical Details

scDiffusion is composed of three jointly operating components. First, an autoencoder built on the SCimilarity foundation model compresses high-dimensional gene expression vectors (tens of thousands of genes) into 128-dimensional latent representations. The decoder reconstructs full expression profiles from these embeddings and partially corrects for the zero-inflated, sparse distributions characteristic of raw scRNA-seq data.

Second, a denoising network based on a skip-connected multilayer perceptron (MLP) learns the reverse diffusion process over 1,000 timesteps. The MLP architecture is chosen deliberately over transformers to accommodate the sparse, unordered nature of gene expression vectors; skip connections preserve biological signal during iterative denoising. Third, condition controllers — independently trained classifiers — inject gradient signals at each timestep to steer generation toward specified biological conditions. This decoupled design allows new conditions to be added without retraining the diffusion backbone.

Training data draws on large-scale publicly available scRNA-seq atlases, with the SCimilarity encoder having been pre-trained on 22.7 million cells spanning diverse tissues and species. Inference proceeds from Gaussian noise, iteratively refined under classifier guidance, and is compatible with standard GPU hardware.

Applications

scDiffusion serves researchers who need to augment limited experimental datasets for downstream machine learning tasks, particularly in cases where rare cell types are insufficiently represented. Developmental biologists can use Gradient Interpolation to reconstruct transcriptional transitions — such as embryonic reprogramming or lineage commitment — by generating synthetic cells along a continuous trajectory between two experimentally characterized states. The model also supports experimental design by enabling in-silico exploration of uncharacterized condition combinations, helping prioritize which experiments to run. More speculatively, synthetic cells representing drug-treated states could complement real perturbation data in pharmacogenomics workflows.

Impact

scDiffusion demonstrates that combining pre-trained biological foundation models with diffusion-based generative frameworks can produce synthetic single-cell data with sufficient fidelity for downstream biological analysis. The 93% CellTypist classification accuracy of generated cells and the SCC of 0.984 relative to real data are competitive benchmarks that distinguish it from earlier variational autoencoder-based approaches such as scVI. The Gradient Interpolation strategy is a practical contribution that could be adapted to other conditional generative tasks in biology. Current limitations include dependence on SCimilarity's training distribution — performance may degrade for cell types poorly represented in that foundation model's pre-training corpus — and the model does not natively handle multi-omics data or spatial transcriptomics contexts.

Citation

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Luo, E., Hao, M., Wei, L., & Zhang, X. (2024). scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics, 40(9), btae518.

DOI: 10.1093/bioinformatics/btae518

Recent citations

Papers that recently cited this model.

Islands and bridges: Momentum contrastive coupling unifies discrete and continuous structure in single-cell omics
Zeyu Fu, Chunlin Chen, Keyang Zhang
Biomedical Signal Processing and Control · 2026
0
Tokenizing single-cell transcriptomes as a native language for large language models
Chuxi Xiao, Yuang Ding, Haiyang Bian, et al.
bioRxiv · Jul 2026
0
scJET: Full-gene Space Single-cell Expression Generation with Patch-based Transformer Modeling
Qiantong Liang, Q. R. Lyu
bioRxiv · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Cell2Sentence: Teaching Large Language Models the Language of Biology
Daniel Levine, S. Rizvi, Sacha Lévy, et al.
bioRxiv · Oct 2024
75
Diffusion Generative Modeling for Spatially Resolved Gene Expression Inference from Histology Images
Sichen Zhu, Yuchen Zhu, Molei Tao, et al.
International Conference on Learning Representations · Jan 2025
31
From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research
A. Muneer, M. Waqas, Maliazurina B. Saad, et al.
Artificial Intelligence Review · Jul 2025
19
Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces
Kevin Rojas, Yuchen Zhu, Sichen Zhu, et al.
International Conference on Machine Learning · Jun 2025
18
Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen
A. Palma, Till Richter, Hanyi Zhang, et al.
International Conference on Learning Representations · Jul 2024
17Influential

Citations

Total Citations64

Influential3

References71

GitHub

Stars94

Forks15

Open Issues8

Contributors2

Last Push6mo ago

LanguageJupyter Notebook

LicenseMIT

Fields of citing research

Computer Science95%
Biology86%
Medicine44%
Mathematics8%
Physics3%
Environmental Science3%
Engineering3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

60Partial

Usability — can I run it?64

Reproducibility — can I retrain it?57

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Dataset Dataset

Key Features

Conditional multi-factor generation: Separate classifier networks guide the diffusion process simultaneously across multiple biological variables — organ type, cell type, and other conditions — without interfering with the core denoising network.

Gradient Interpolation for trajectory synthesis: Blends gradient signals from two condition classifiers using adjustable weights to generate smooth, continuous developmental transitions between known cell states, including reprogramming and differentiation events.

Foundation model integration: Encodes raw gene expression profiles into 128-dimensional latent embeddings using SCimilarity, a foundation model pre-trained on 22.7 million cells, providing biologically grounded representations across diverse tissues.

Rare cell type synthesis: Generates underrepresented cell populations that are statistically difficult to sample experimentally, expanding the effective coverage of synthetic datasets.

Out-of-distribution generalization: Produces cells under condition combinations absent from training data, enabling exploration of hypothetical biological states prior to experimental validation.

Technical Details

Applications

Impact

Citation

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Luo, E., Hao, M., Wei, L., & Zhang, X. (2024). scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics, 40(9), btae518.

DOI: 10.1093/bioinformatics/btae518

Top citations

The most-cited papers that cite this model.

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

A. Palma, Till Richter, Hanyi Zhang, et al.

International Conference on Learning Representations · Jul 2024

17Influential

scDiffusion

#Key Features

#Technical Details

#Applications

#Impact

Citation

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Recent citations

Top citations

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

scDiffusion

#Key Features

#Technical Details

#Applications

#Impact

Citation

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Recent citations

Top citations

Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact