xVERSE

Transcriptomics-native single-cell foundation model that learns batch-invariant cell representations and probabilistically generates virtual cells.

Released: April 2026

xVERSE is a transcriptomics-native single-cell foundation model that unifies two tasks usually pursued separately: learning batch-invariant cell representations and probabilistically generating realistic expression profiles. Introduced in an April 2026 bioRxiv preprint by Jiang and Xie at Duke University, the model targets a persistent problem in single-cell analysis—batch effects that confound the embeddings produced by many existing foundation models—while adding a generative capability that synthesizes "virtual cells" closely matching real measurements.

Rather than borrowing architectures and tokenization schemes designed for natural language and adapting them to gene expression, xVERSE is built around the structure of transcriptomic data itself. This transcriptomics-native design is intended to preserve genuine biological heterogeneity (cell types, states, and gradients) while removing technical variation introduced by sequencing platform, lab, or experiment. The result is a representation space in which biologically similar cells cluster together regardless of which batch they came from.

xVERSE arrives amid active debate over whether single-cell foundation models truly deliver universal embeddings or merely encode batch structure. By coupling representation learning with a calibrated generative model, it positions itself both as an embedding model and as a data-augmentation engine for settings where real cells are scarce. As of the preprint, the authors report no public release of weights or code.

Key Features

Batch-invariant representations: Learns cell embeddings that suppress technical batch variation while preserving biological signal, reportedly improving over leading foundation models by 17.9% and dedicated batch-correction methods by 11.4% on representation-learning benchmarks.
Probabilistic virtual-cell generation: Synthesizes expression profiles that are statistically indistinguishable from biological cells, with a real-versus-synthetic classifier achieving AUROC near 0.5 (chance level).
Data augmentation for tiny datasets: Uses generated virtual cells to enable accurate clustering and marker detection in very small datasets, resolving rare cell types from as few as four observed cells.
Strong spatial imputation: Reports a 34.3% improvement over the second-best method on spatial transcriptomic imputation, extending utility beyond dissociated single-cell data.
Unified representation and generation: Combines an embedding model and a generative model in one framework, rather than treating these as separate pipelines.

Technical Details

xVERSE is a transcriptomics-native foundation model that pairs an encoder for batch-invariant representation learning with a probabilistic generative component for expression synthesis. The generative module is calibrated such that synthesized profiles cannot be reliably distinguished from real cells (AUROC ≈ 0.5), which the authors leverage for augmentation. Reported benchmark gains include a 17.9% improvement over leading single-cell foundation models and 11.4% over batch-effect correction baselines for representation learning, plus a 34.3% improvement over the next-best method on spatial imputation. Detailed architecture specifics, parameter count, and the full composition of the pretraining corpus are described in the preprint; public weights and code had not been released at the time of posting.

Applications

xVERSE is aimed at computational biologists and single-cell practitioners who need clean, batch-corrected embeddings for integrating heterogeneous datasets, as well as those working with small or rare-population samples where conventional clustering and marker detection fail. Its virtual-cell generation supports data augmentation for underpowered experiments, in silico expansion of rare cell types, and imputation of spatial transcriptomic measurements, making it relevant to atlas building, rare-disease and tumor-microenvironment studies, and benchmarking pipelines.

Impact

By directly tackling the batch-effect limitations that have drawn scrutiny to single-cell foundation models, xVERSE contributes to an active line of work on what "universal" cell embeddings should mean and how to validate them. Its tight coupling of representation learning with high-fidelity generation—and the demonstration that synthetic cells can rescue analysis of extremely small populations—offers a concrete direction for data augmentation in single-cell biology. As a recent preprint without released weights or code, its real-world adoption and independent validation remain to be established.

Citation

A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis

Jiang, X. & Xie, J. (2026) A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis. bioRxiv.

DOI: 10.64898/2026.04.12.718016

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations29

Influential3

References0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

10Closed

Usability — can I run it?7

Reproducibility — can I retrain it?10

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Batch-invariant representations: Learns cell embeddings that suppress technical batch variation while preserving biological signal, reportedly improving over leading foundation models by 17.9% and dedicated batch-correction methods by 11.4% on representation-learning benchmarks.

Probabilistic virtual-cell generation: Synthesizes expression profiles that are statistically indistinguishable from biological cells, with a real-versus-synthetic classifier achieving AUROC near 0.5 (chance level).

Data augmentation for tiny datasets: Uses generated virtual cells to enable accurate clustering and marker detection in very small datasets, resolving rare cell types from as few as four observed cells.

Strong spatial imputation: Reports a 34.3% improvement over the second-best method on spatial transcriptomic imputation, extending utility beyond dissociated single-cell data.

Unified representation and generation: Combines an embedding model and a generative model in one framework, rather than treating these as separate pipelines.

Technical Details

Applications

Impact

xVERSE

Key Features

Technical Details

Applications

Impact

Citation

A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

xVERSE

Key Features

Technical Details

Applications

Impact

Citation

A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

xVERSE

#Key Features

#Technical Details

#Applications

#Impact

Citation

A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

xVERSE

#Key Features

#Technical Details

#Applications

#Impact

Citation

A transcriptomics-native foundation model for universal cell representation and virtual cell synthesis

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact