bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

scDiffusion

Tsinghua University

Generative diffusion model for single-cell RNA-seq data synthesis, enabling controlled generation of specific cell types, rare cells, and developmental trajectories.

Released: 2024

Overview

scDiffusion is a conditional generative model developed at Tsinghua University that synthesizes realistic single-cell RNA sequencing (scRNA-seq) data using a diffusion-based framework. It addresses a persistent bottleneck in single-cell genomics: the scarcity of high-quality training data for rare cell types and undersampled conditions. By combining a pre-trained foundation model encoder with a guided diffusion process, scDiffusion produces synthetic cells that accurately reflect the statistical and biological properties of real transcriptomes.

The model supports multi-condition generation, allowing researchers to specify biological constraints — such as organ type or cell type — simultaneously during sampling. A key innovation is the Gradient Interpolation strategy, which blends classifier guidance signals from two distinct conditions to generate intermediate cell states. This enables reconstruction of continuous developmental trajectories without requiring experimental data for those intermediate states.

Published in Bioinformatics in 2024, scDiffusion achieved a Spearman Correlation Coefficient (SCC) of 0.984 and a Maximum Mean Discrepancy (MMD) of 0.018 relative to held-out real data. Cell type classification of synthetic outputs using CellTypist reached 93% accuracy, confirming the biological plausibility of the generated transcriptomes.

Key Features

  • Conditional multi-factor generation: Separate classifier networks guide the diffusion process simultaneously across multiple biological variables — organ type, cell type, and other conditions — without interfering with the core denoising network.
  • Gradient Interpolation for trajectory synthesis: Blends gradient signals from two condition classifiers using adjustable weights to generate smooth, continuous developmental transitions between known cell states, including reprogramming and differentiation events.
  • Foundation model integration: Encodes raw gene expression profiles into 128-dimensional latent embeddings using SCimilarity, a foundation model pre-trained on 22.7 million cells, providing biologically grounded representations across diverse tissues.
  • Rare cell type synthesis: Generates underrepresented cell populations that are statistically difficult to sample experimentally, expanding the effective coverage of synthetic datasets.
  • Out-of-distribution generalization: Produces cells under condition combinations absent from training data, enabling exploration of hypothetical biological states prior to experimental validation.

Technical Details

scDiffusion is composed of three jointly operating components. First, an autoencoder built on the SCimilarity foundation model compresses high-dimensional gene expression vectors (tens of thousands of genes) into 128-dimensional latent representations. The decoder reconstructs full expression profiles from these embeddings and partially corrects for the zero-inflated, sparse distributions characteristic of raw scRNA-seq data.

Second, a denoising network based on a skip-connected multilayer perceptron (MLP) learns the reverse diffusion process over 1,000 timesteps. The MLP architecture is chosen deliberately over transformers to accommodate the sparse, unordered nature of gene expression vectors; skip connections preserve biological signal during iterative denoising. Third, condition controllers — independently trained classifiers — inject gradient signals at each timestep to steer generation toward specified biological conditions. This decoupled design allows new conditions to be added without retraining the diffusion backbone.

Training data draws on large-scale publicly available scRNA-seq atlases, with the SCimilarity encoder having been pre-trained on 22.7 million cells spanning diverse tissues and species. Inference proceeds from Gaussian noise, iteratively refined under classifier guidance, and is compatible with standard GPU hardware.

Applications

scDiffusion serves researchers who need to augment limited experimental datasets for downstream machine learning tasks, particularly in cases where rare cell types are insufficiently represented. Developmental biologists can use Gradient Interpolation to reconstruct transcriptional transitions — such as embryonic reprogramming or lineage commitment — by generating synthetic cells along a continuous trajectory between two experimentally characterized states. The model also supports experimental design by enabling in-silico exploration of uncharacterized condition combinations, helping prioritize which experiments to run. More speculatively, synthetic cells representing drug-treated states could complement real perturbation data in pharmacogenomics workflows.

Impact

scDiffusion demonstrates that combining pre-trained biological foundation models with diffusion-based generative frameworks can produce synthetic single-cell data with sufficient fidelity for downstream biological analysis. The 93% CellTypist classification accuracy of generated cells and the SCC of 0.984 relative to real data are competitive benchmarks that distinguish it from earlier variational autoencoder-based approaches such as scVI. The Gradient Interpolation strategy is a practical contribution that could be adapted to other conditional generative tasks in biology. Current limitations include dependence on SCimilarity's training distribution — performance may degrade for cell types poorly represented in that foundation model's pre-training corpus — and the model does not natively handle multi-omics data or spatial transcriptomics contexts.

Citation

scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Luo, E., Hao, M., Wei, L., & Zhang, X. (2024). scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics, 40(9), btae518.

DOI: 10.1093/bioinformatics/btae518

Metrics

GitHub

Stars89
Forks15
Open Issues8
Contributors2
Last Push3mo ago
LanguageJupyter Notebook
LicenseMIT

Citations

Total Citations46
Influential3
References59

Tags

diffusionfoundation modelgenerative

Resources

GitHub RepositoryResearch PaperDataset