Overview

scPRINT (Single-cell PRe-trained INference with Transformers) is a large-scale foundation model for single-cell RNA sequencing analysis, developed by Jérémie Kalfon, Jules Samaran, Gabriel Peyré, and Laura Cantini at the Institut Pasteur and CNRS in Paris. Published in Nature Communications in April 2025, it was pre-trained on over 50 million cells drawn from the CellxGene database, representing approximately 80 billion tokens across 548 datasets spanning human and mouse primary tissues, diverse diseases, sequencing platforms, and demographic groups.

The model's central innovation is its ability to infer cell-specific, genome-wide gene regulatory networks by leveraging the attention matrices of a bidirectional transformer — an approach that is both interpretable and computationally tractable at atlas scale. Where traditional gene regulatory network (GRN) inference methods are computationally prohibitive for modern million-cell datasets, scPRINT can generate genome-wide networks for up to 10,000 cells in minutes on commodity hardware.

scPRINT also demonstrates strong zero-shot performance across several ancillary tasks — including expression denoising, batch effect correction, and cell type prediction — without any task-specific fine-tuning, reflecting the breadth of biological information captured during pre-training.

Key Features

Atlas-scale gene network inference: Extracts cell-type-specific and cell-specific gene regulatory networks by aggregating transformer attention matrices, recovering 67% more connections than GENIE3 on benchmark datasets and outperforming scGPT, Geneformer v2, and DeepSEM on gene network benchmarks.
Multi-objective pre-training: Trained jointly on three objectives — expression denoising (with 60% simulated transcript dropout), bottleneck cell embedding reconstruction, and hierarchical label prediction — enabling the model to learn complementary representations of cell state, identity, and regulatory structure.
Multi-modal expression encoder: Each gene is represented by three fused embeddings: ESM2 protein language model features (providing evolutionary and functional context), genomic positional encodings (capturing chromosomal organization), and expression-level tokenization via a two-layer MLP.
Disentangled cell embeddings: The classification decoder produces separate embeddings for cell type, disease, organism, sequencing platform, ethnicity, and sex, enabling interpretable downstream analysis and zero-shot transfer across experimental conditions.
Competitive zero-shot capabilities: Achieves 62% zero-shot accuracy on a multi-batch pancreas cell type classification task, matches MAGIC and KNNsmoothing2 on expression denoising, and ranks as the top unsupervised method for batch correction without requiring batch label information.
Scalable model family: Available from 2 million to 100 million parameters, with the medium-scale model trainable on a single A40 GPU in approximately 48 hours, lowering the barrier for laboratory-scale deployment.

Technical Details

scPRINT is built on a bidirectional multi-head transformer backbone accelerated with FlashAttention2, operating over a context window of 2,200 genes — covering more than 80% of cells in CellxGene. The expression encoder integrates three per-gene embeddings: ESM2 representations from a 650-million parameter protein language model, genomic positional encodings reflecting chromosomal coordinates, and count-normalized expression values processed through a two-layer MLP. The decoder produces parameters for a zero-inflated negative binomial distribution, appropriately modeling the sparse and overdispersed nature of scRNA-seq count data. A separate classification decoder generates the six disentangled cell-level embeddings.

Gene regulatory networks are extracted post-hoc by aggregating attention head weights across selected heads of the pre-trained transformer, then filtering for transcription factor-to-gene connections. Across independent benchmarks including Omnipath literature-curated networks, cell-type-specific ground truth from embryonic stem cell perturbation data, and genome-wide Perturb-seq experiments, scPRINT consistently outperforms competing methods. The model was pre-trained on 54,084,961 cells from 548 CellxGene datasets and is available open-source alongside BenGRN, a dedicated benchmarking suite for gene regulatory network inference released alongside the model.

Applications

scPRINT is designed for researchers working with large single-cell transcriptomic datasets who need to move beyond cell clustering toward mechanistic understanding of gene regulation. Primary use cases include inferring cell-type-specific gene regulatory networks from atlas-scale data, denoising sparse count matrices to improve downstream analysis, integrating data across batches and experimental conditions without requiring batch labels, and annotating cells across diverse tissues and organisms in zero-shot settings. The model is particularly valuable for studies of transcription factor activity, cell fate decisions, and disease-associated regulatory rewiring, where interpretable network structure is as important as predictive accuracy.

Impact

scPRINT addresses a long-standing bottleneck in single-cell genomics: scaling gene regulatory network inference to the size and diversity of modern cell atlases. Its publication in Nature Communications and the concurrent release of the BenGRN benchmarking suite have provided the field with both a high-performing model and a rigorous evaluation framework for comparing GRN inference methods. The approach of extracting networks from transformer attention matrices — rather than treating GRN inference as a separate downstream task — offers a conceptually new direction for interpretable foundation models in genomics. A successor model, scPRINT-2, has already been reported as a preprint (December 2025), pre-trained on 350 million cells across 16 organisms, suggesting the framework is actively being extended. A key current limitation is that attention-based network extraction captures correlational gene co-regulation patterns and does not guarantee causal directionality, requiring orthogonal experimental validation for mechanistic claims.

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Kalfon, J., Samaran, J., Peyré, G., & Cantini, L. (2025). scPRINT: pre-training on 50 million cells allows robust gene network predictions. Nature Communications, 16(1), 3607.

DOI: 10.1038/s41467-025-58699-1

Overview

Key Features

Atlas-scale gene network inference: Extracts cell-type-specific and cell-specific gene regulatory networks by aggregating transformer attention matrices, recovering 67% more connections than GENIE3 on benchmark datasets and outperforming scGPT, Geneformer v2, and DeepSEM on gene network benchmarks.

Multi-objective pre-training: Trained jointly on three objectives — expression denoising (with 60% simulated transcript dropout), bottleneck cell embedding reconstruction, and hierarchical label prediction — enabling the model to learn complementary representations of cell state, identity, and regulatory structure.

Multi-modal expression encoder: Each gene is represented by three fused embeddings: ESM2 protein language model features (providing evolutionary and functional context), genomic positional encodings (capturing chromosomal organization), and expression-level tokenization via a two-layer MLP.

Disentangled cell embeddings: The classification decoder produces separate embeddings for cell type, disease, organism, sequencing platform, ethnicity, and sex, enabling interpretable downstream analysis and zero-shot transfer across experimental conditions.

Competitive zero-shot capabilities: Achieves 62% zero-shot accuracy on a multi-batch pancreas cell type classification task, matches MAGIC and KNNsmoothing2 on expression denoising, and ranks as the top unsupervised method for batch correction without requiring batch label information.

Scalable model family: Available from 2 million to 100 million parameters, with the medium-scale model trainable on a single A40 GPU in approximately 48 hours, lowering the barrier for laboratory-scale deployment.

Technical Details

Applications

Impact

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Kalfon, J., Samaran, J., Peyré, G., & Cantini, L. (2025). scPRINT: pre-training on 50 million cells allows robust gene network predictions. Nature Communications, 16(1), 3607.

DOI: 10.1038/s41467-025-58699-1

scPRINT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Metrics

GitHub

Citations

Tags

Resources

scPRINT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPRINT: pre-training on 50 million cells allows robust gene network predictions

Metrics

GitHub

Citations

Tags

Resources