Overview

scPRINT is a large-scale single-cell foundation model developed by Laura Cantini's group at the Institut Pasteur, trained on over 50 million cells drawn from the cellxgene database. Published in Nature Communications in April 2025, the model was designed to address a central challenge in single-cell transcriptomics: inferring the gene regulatory networks that drive cellular identity and disease, while simultaneously handling the pervasive technical noise and batch effects that make single-cell RNA-seq data difficult to interpret across studies.

The model's defining capability is cell-type-specific, genome-wide gene network inference at scale — a task where scPRINT substantially outperforms previous approaches including GENIE3, scGPT, Geneformer v2, and DeepSEM. Rather than requiring fine-tuning for each new dataset, scPRINT performs many of its core tasks in zero-shot mode, applying learned representations directly to novel cell populations and tissue contexts. This makes the model broadly useful across the long tail of biological questions where annotated training data is scarce.

scPRINT also releases two benchmarking suites alongside the model: BenGRN for gene regulatory network evaluation and GrnnData for gene network datasets, both designed to standardize how the field compares network inference methods. These infrastructure contributions reflect a broader commitment to reproducibility and community benchmarking that distinguishes the scPRINT release from prior single-cell foundation models.

Key Features

Gene network inference: Generates cell-type-specific, genome-wide gene regulatory networks using attention head analysis, recovering 67% more connections than GENIE3 on Omnipath ground truth benchmarks and substantially outperforming competing foundation models.
Multi-task zero-shot prediction: Simultaneously produces cell type annotations, denoised expression profiles, batch-corrected embeddings, and disease-specific representations from a single forward pass without task-specific fine-tuning.
Biologically informed encoding: Each gene is represented using three complementary embedding types — ESM2 protein embeddings for gene identity, an MLP over log-normalized counts for expression magnitude, and genomic positional encoding to capture regulatory co-localization on chromosomes.
Disentangled cell embeddings: The model produces multiple distinct cell representations capturing different biological facets — default cell embedding, cell-type-specific embedding, disease-specific embedding, and additional strata including sex, ethnicity, organism, and sequencer — enabling targeted analysis of specific sources of biological variation.
Rare cell population performance: Denoising benchmarks show particular strength on populations of 10–200 cells, where standard smoothing approaches degrade due to insufficient nearest-neighbor coverage.
Computational accessibility: The medium-scale model can be trained on a single A40 GPU in 48 hours, and inference runs on commodity hardware, lowering barriers to adoption in labs without large compute budgets.

Technical Details

scPRINT is a bidirectional multi-head transformer encoder with a maximum context window of 2,200 genes, a value chosen to cover approximately 80% of cells in the cellxgene database without truncation. Model scales range from 2 million parameters (small) to 100 million parameters (large). The encoder is coupled to two decoder heads: a zero-inflated negative binomial (ZiNB) decoder for expression reconstruction and a hierarchical classification decoder predicting cell type (~400 leaf labels), disease, sequencer, ethnicity, sex, and organism. Training used efficient FlashAttention2 throughout.

Pre-training across 54,084,961 cells and approximately 80 billion tokens involved three concurrent tasks: a denoising objective that recovers true expression from 60%-downsampled profiles using a zero-inflated Poisson formulation; a bottleneck learning task that compresses cell embeddings to enable reconstruction without expression values; and the hierarchical classification task described above. Weighted random sampling with a factor of 50 for rare cell types mitigated class imbalance. Zero-shot cell type classification on the pancreas multi-batch benchmark achieved 62% accuracy with macro-F1 scores comparable to supervised state-of-the-art methods.

Applications

scPRINT is particularly well-suited for studies requiring interpretable gene regulatory network reconstruction across many cell types simultaneously — for example, comparing healthy and diseased regulatory programs in patient biopsies or atlases. The model's zero-shot capabilities make it applicable to new tissues and species without retraining. In a demonstrated use case on benign prostatic hyperplasia, scPRINT identified rare B-cell populations with early tumor microenvironment markers and revealed a PAGE4 hub gene linking fibroblast senescence to chronic inflammation via metal and ion exchange pathways. Researchers working in functional genomics, network biology, and disease mechanism discovery will find the batch correction and denoising outputs useful as preprocessing steps before downstream statistical analysis.

Impact

scPRINT advances the single-cell foundation model field in two key respects: its emphasis on interpretable, mechanistically grounded gene network inference (rather than only cell embeddings), and its commitment to standardized benchmarking through BenGRN and GrnnData. By releasing both the 100M-parameter model weights and the benchmarking infrastructure, the Cantini lab provides the community with tools to objectively evaluate future methods — a contribution that may have lasting influence beyond the model itself. A successor model, scPRINT-2, pre-trained on 350 million cells across 16 organisms, has already been reported as a preprint, suggesting active development. One current limitation is the fixed 2,200-gene context window, which excludes expression data for a minority of cells with broader transcriptomic coverage, and gene network predictions are currently most reliable for well-represented cell types in the training corpus.

Overview

Key Features

Gene network inference: Generates cell-type-specific, genome-wide gene regulatory networks using attention head analysis, recovering 67% more connections than GENIE3 on Omnipath ground truth benchmarks and substantially outperforming competing foundation models.

Multi-task zero-shot prediction: Simultaneously produces cell type annotations, denoised expression profiles, batch-corrected embeddings, and disease-specific representations from a single forward pass without task-specific fine-tuning.

Biologically informed encoding: Each gene is represented using three complementary embedding types — ESM2 protein embeddings for gene identity, an MLP over log-normalized counts for expression magnitude, and genomic positional encoding to capture regulatory co-localization on chromosomes.

Disentangled cell embeddings: The model produces multiple distinct cell representations capturing different biological facets — default cell embedding, cell-type-specific embedding, disease-specific embedding, and additional strata including sex, ethnicity, organism, and sequencer — enabling targeted analysis of specific sources of biological variation.

Rare cell population performance: Denoising benchmarks show particular strength on populations of 10–200 cells, where standard smoothing approaches degrade due to insufficient nearest-neighbor coverage.

Computational accessibility: The medium-scale model can be trained on a single A40 GPU in 48 hours, and inference runs on commodity hardware, lowering barriers to adoption in labs without large compute budgets.

Technical Details

Applications

Impact

scPRINT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Metrics

GitHub

Citations

Tags

Resources

scPRINT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Metrics

GitHub

Citations

Tags

Resources