Generative diffusion model for single-cell RNA-seq data synthesis, enabling controlled generation of specific cell types, rare cells, and developmental trajectories.
scDiffusion is a conditional generative model developed at Tsinghua University that synthesizes realistic single-cell RNA sequencing (scRNA-seq) data using a diffusion-based framework. It addresses a persistent bottleneck in single-cell genomics: the scarcity of high-quality training data for rare cell types and undersampled conditions. By combining a pre-trained foundation model encoder with a guided diffusion process, scDiffusion produces synthetic cells that accurately reflect the statistical and biological properties of real transcriptomes.
The model supports multi-condition generation, allowing researchers to specify biological constraints — such as organ type or cell type — simultaneously during sampling. A key innovation is the Gradient Interpolation strategy, which blends classifier guidance signals from two distinct conditions to generate intermediate cell states. This enables reconstruction of continuous developmental trajectories without requiring experimental data for those intermediate states.
Published in Bioinformatics in 2024, scDiffusion achieved a Spearman Correlation Coefficient (SCC) of 0.984 and a Maximum Mean Discrepancy (MMD) of 0.018 relative to held-out real data. Cell type classification of synthetic outputs using CellTypist reached 93% accuracy, confirming the biological plausibility of the generated transcriptomes.
scDiffusion is composed of three jointly operating components. First, an autoencoder built on the SCimilarity foundation model compresses high-dimensional gene expression vectors (tens of thousands of genes) into 128-dimensional latent representations. The decoder reconstructs full expression profiles from these embeddings and partially corrects for the zero-inflated, sparse distributions characteristic of raw scRNA-seq data.
Second, a denoising network based on a skip-connected multilayer perceptron (MLP) learns the reverse diffusion process over 1,000 timesteps. The MLP architecture is chosen deliberately over transformers to accommodate the sparse, unordered nature of gene expression vectors; skip connections preserve biological signal during iterative denoising. Third, condition controllers — independently trained classifiers — inject gradient signals at each timestep to steer generation toward specified biological conditions. This decoupled design allows new conditions to be added without retraining the diffusion backbone.
Training data draws on large-scale publicly available scRNA-seq atlases, with the SCimilarity encoder having been pre-trained on 22.7 million cells spanning diverse tissues and species. Inference proceeds from Gaussian noise, iteratively refined under classifier guidance, and is compatible with standard GPU hardware.
scDiffusion serves researchers who need to augment limited experimental datasets for downstream machine learning tasks, particularly in cases where rare cell types are insufficiently represented. Developmental biologists can use Gradient Interpolation to reconstruct transcriptional transitions — such as embryonic reprogramming or lineage commitment — by generating synthetic cells along a continuous trajectory between two experimentally characterized states. The model also supports experimental design by enabling in-silico exploration of uncharacterized condition combinations, helping prioritize which experiments to run. More speculatively, synthetic cells representing drug-treated states could complement real perturbation data in pharmacogenomics workflows.
scDiffusion demonstrates that combining pre-trained biological foundation models with diffusion-based generative frameworks can produce synthetic single-cell data with sufficient fidelity for downstream biological analysis. The 93% CellTypist classification accuracy of generated cells and the SCC of 0.984 relative to real data are competitive benchmarks that distinguish it from earlier variational autoencoder-based approaches such as scVI. The Gradient Interpolation strategy is a practical contribution that could be adapted to other conditional generative tasks in biology. Current limitations include dependence on SCimilarity's training distribution — performance may degrade for cell types poorly represented in that foundation model's pre-training corpus — and the model does not natively handle multi-omics data or spatial transcriptomics contexts.
Luo, E., Hao, M., Wei, L., & Zhang, X. (2024). scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics, 40(9), btae518.
DOI: 10.1093/bioinformatics/btae518