scDiVa

Single-cell foundation model built on masked discrete diffusion, jointly generating gene identities and expression values from 59 million cells.

Released: February 2026

Foundation models for single-cell RNA sequencing increasingly borrow the autoregressive recipe from large language models, generating a cell's gene tokens one after another. But cells are not sentences: there is no natural left-to-right order over genes, so imposing one introduces an artificial ordering bias, and autoregressive generation can suffer from error accumulation as each prediction conditions on previous ones. This mismatch motivates generation strategies that treat a cell's genes as an unordered set.

scDiVa, introduced by Mingxuan Wang and colleagues (with senior author Yanbiao Ma; the authors are at Chinese institutions, including Renmin University of China) in a February 2026 arXiv preprint, is a single-cell foundation model built on masked discrete diffusion. Rather than ordering genes, it uses a continuous-time forward masking process and a bidirectional denoiser to jointly model two coupled aspects of a cell: discrete gene identities and continuous expression values. This lets the model fill in a cell's profile without the ordering bias and error accumulation of autoregressive approaches.

Pre-trained on 59 million cells, scDiVa is positioned by the authors as a biologically coherent alternative to autoregression, with strong performance across core single-cell tasks: batch integration, cell-type annotation, and perturbation-response prediction.

Key Features

Masked discrete diffusion: A continuous-time forward masking process in token space replaces autoregressive ordering, avoiding artificial gene ordering bias and error accumulation.
Joint identity and expression modeling: A bidirectional denoiser handles both discrete gene identities and continuous expression values together, capturing the two coupled facets of a cell.
Information-preserving serialization: Entropy-normalized serialization with a latent anchor token helps preserve information when representing a cell for the diffusion process.
Robust training objectives: Depth-invariant time sampling and dual denoising objectives stabilize learning across cells of varying sequencing depth.
Large-scale pre-training: Trained on 59 million cells, giving broad coverage of cell states for downstream transfer.

Technical Details

scDiVa is a masked discrete-diffusion foundation model for single-cell data. Its forward process progressively masks tokens in continuous time, and a bidirectional denoiser reconstructs both discrete gene identities and continuous expression values, using entropy-normalized serialization with a latent anchor token to preserve information. Training employs depth-invariant time sampling and dual denoising objectives, and the model was pre-trained on 59 million cells. The authors report strong results on batch integration, cell-type annotation, and perturbation-response prediction, framing the approach as an effective alternative to autoregressive single-cell models. The work notes pre-training on a large proprietary corpus whose composition cannot be disclosed for privacy and commercial reasons; as a February 2026 preprint, code and model weights have not been released, and parameter count is not stated.

Applications

scDiVa is aimed at single-cell researchers needing a general-purpose model for transcriptomic analysis. Its demonstrated tasks — integrating cells across batches, annotating cell types, and predicting perturbation responses — cover much of the routine single-cell workflow, and its generative formulation also supports imputation and in-silico exploration of cell states. Researchers studying genetic perturbations or assembling atlases across datasets could use it to harmonize and interpret heterogeneous single-cell data.

Impact

scDiVa adds to evidence that discrete-diffusion generation, rather than autoregression, may be a better fit for the unordered, set-like structure of single-cell gene expression. By reporting competitive results across batch integration, annotation, and perturbation prediction from a single pre-trained model, it argues for masked diffusion as a viable backbone for single-cell foundation models. As a recent preprint trained on an undisclosed proprietary corpus and without released code or weights, its results await peer review and independent reproduction, which the proprietary data and absent code will make harder to verify.

Citation

ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Preprint

Wang, M., et al. (2026) ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression. arXiv.org.

DOI: 10.48550/arXiv.2602.03477

Recent citations

Papers that recently cited this model.

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
0

Citations

Total Citations1

Influential0

References41

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

6Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website

Key Features

Masked discrete diffusion: A continuous-time forward masking process in token space replaces autoregressive ordering, avoiding artificial gene ordering bias and error accumulation.

Joint identity and expression modeling: A bidirectional denoiser handles both discrete gene identities and continuous expression values together, capturing the two coupled facets of a cell.

Information-preserving serialization: Entropy-normalized serialization with a latent anchor token helps preserve information when representing a cell for the diffusion process.

Robust training objectives: Depth-invariant time sampling and dual denoising objectives stabilize learning across cells of varying sequencing depth.

Large-scale pre-training: Trained on 59 million cells, giving broad coverage of cell states for downstream transfer.

Technical Details

Applications

Impact

scDiVa

Key Features

Technical Details

Applications

Impact

Citation

ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Recent citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Top citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Citations

Fields of citing research

Openness

Tags

Resources

scDiVa

Key Features

Technical Details

Applications

Impact

Citation

ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Recent citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Top citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Citations

Fields of citing research

Openness

Tags

Resources

scDiVa

#Key Features

#Technical Details

#Applications

#Impact

Citation

ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Recent citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Top citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Related models

Citations

Fields of citing research

Openness

Tags

Resources

scDiVa

#Key Features

#Technical Details

#Applications

#Impact

Citation

ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Recent citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Top citations

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact