Single-cell foundation model pre-trained on 50 million cells for gene network inference, denoising, and cell type prediction.
scPRINT is a large-scale single-cell foundation model developed by Laura Cantini's group at the Institut Pasteur, trained on over 50 million cells drawn from the cellxgene database. Published in Nature Communications in April 2025, the model was designed to address a central challenge in single-cell transcriptomics: inferring the gene regulatory networks that drive cellular identity and disease, while simultaneously handling the pervasive technical noise and batch effects that make single-cell RNA-seq data difficult to interpret across studies.
The model's defining capability is cell-type-specific, genome-wide gene network inference at scale — a task where scPRINT substantially outperforms previous approaches including GENIE3, scGPT, Geneformer v2, and DeepSEM. Rather than requiring fine-tuning for each new dataset, scPRINT performs many of its core tasks in zero-shot mode, applying learned representations directly to novel cell populations and tissue contexts. This makes the model broadly useful across the long tail of biological questions where annotated training data is scarce.
scPRINT also releases two benchmarking suites alongside the model: BenGRN for gene regulatory network evaluation and GrnnData for gene network datasets, both designed to standardize how the field compares network inference methods. These infrastructure contributions reflect a broader commitment to reproducibility and community benchmarking that distinguishes the scPRINT release from prior single-cell foundation models.
scPRINT is a bidirectional multi-head transformer encoder with a maximum context window of 2,200 genes, a value chosen to cover approximately 80% of cells in the cellxgene database without truncation. Model scales range from 2 million parameters (small) to 100 million parameters (large). The encoder is coupled to two decoder heads: a zero-inflated negative binomial (ZiNB) decoder for expression reconstruction and a hierarchical classification decoder predicting cell type (~400 leaf labels), disease, sequencer, ethnicity, sex, and organism. Training used efficient FlashAttention2 throughout.
Pre-training across 54,084,961 cells and approximately 80 billion tokens involved three concurrent tasks: a denoising objective that recovers true expression from 60%-downsampled profiles using a zero-inflated Poisson formulation; a bottleneck learning task that compresses cell embeddings to enable reconstruction without expression values; and the hierarchical classification task described above. Weighted random sampling with a factor of 50 for rare cell types mitigated class imbalance. Zero-shot cell type classification on the pancreas multi-batch benchmark achieved 62% accuracy with macro-F1 scores comparable to supervised state-of-the-art methods.
scPRINT is particularly well-suited for studies requiring interpretable gene regulatory network reconstruction across many cell types simultaneously — for example, comparing healthy and diseased regulatory programs in patient biopsies or atlases. The model's zero-shot capabilities make it applicable to new tissues and species without retraining. In a demonstrated use case on benign prostatic hyperplasia, scPRINT identified rare B-cell populations with early tumor microenvironment markers and revealed a PAGE4 hub gene linking fibroblast senescence to chronic inflammation via metal and ion exchange pathways. Researchers working in functional genomics, network biology, and disease mechanism discovery will find the batch correction and denoising outputs useful as preprocessing steps before downstream statistical analysis.
scPRINT advances the single-cell foundation model field in two key respects: its emphasis on interpretable, mechanistically grounded gene network inference (rather than only cell embeddings), and its commitment to standardized benchmarking through BenGRN and GrnnData. By releasing both the 100M-parameter model weights and the benchmarking infrastructure, the Cantini lab provides the community with tools to objectively evaluate future methods — a contribution that may have lasting influence beyond the model itself. A successor model, scPRINT-2, pre-trained on 350 million cells across 16 organisms, has already been reported as a preprint, suggesting active development. One current limitation is the fixed 2,200-gene context window, which excludes expression data for a minority of cells with broader transcriptomic coverage, and gene network predictions are currently most reliable for well-represented cell types in the training corpus.
Kalfon, J., et al. (2024) scPRINT: pre-training on 50 million cells allows robust gene network predictions. bioRxiv.
DOI: 10.1038/s41467-025-58699-1