Overview

scFoundation (also referred to as xTrimoscFoundation-alpha) is a pretrained foundation model for single-cell transcriptomics, developed by Biomap Research and published in Nature Methods in June 2024. The model was designed to address a persistent challenge in single-cell biology: no single model had been trained at a scale large enough to capture the full diversity of human cell types and gene co-expression programs across tissues and experimental conditions.

With 100 million parameters trained on more than 50 million human single-cell RNA sequencing (scRNA-seq) profiles covering approximately 19,264 genes, scFoundation establishes a new scale benchmark for this domain. Rather than learning task-specific representations, it learns general-purpose gene and cell embeddings through a self-supervised pretraining objective that generalizes across downstream analyses without requiring fine-tuning.

The model builds on the xTrimoGene architecture, which was developed internally at Biomap Research as an efficient backbone for high-dimensional transcriptomics data. Its release was accompanied by open-source code, pretrained weights, and an online inference API, making it accessible to researchers without access to large compute resources.

Key Features

Scale: 100 million trainable parameters trained on over 50 million human single-cell transcriptomic profiles, covering 19,264 genes — among the largest models of its kind at time of publication.
Asymmetric transformer design: The underlying xTrimoGene architecture uses an asymmetric transformer structure optimized for the high-dimensional, sparse nature of scRNA-seq count matrices, enabling efficient scaling without sacrificing representational capacity.
Read-depth-aware pretraining: A specialized pretraining objective links cells with different sequencing depths, directly modeling technical variation that commonly confounds cross-dataset analyses.
Dual-level embeddings: The model produces both cell-level and gene-level representations in a single forward pass, supporting a broad range of downstream tasks that require either cellular or molecular-level understanding.
Zero-shot transfer: Strong performance on multiple benchmarks is achievable without task-specific fine-tuning, demonstrating that the learned representations generalize across cell types, tissues, and experimental platforms.

Technical Details

scFoundation is built on the xTrimoGene asymmetric transformer architecture, which processes gene expression profiles as sparse token sequences where each expressed gene is treated as a token. The asymmetric design reduces the computational cost of attending over approximately 20,000 gene tokens by separating encoder depth for high-expression versus low-expression genes, allowing the full gene space to be covered without prohibitive memory requirements.

Pretraining used a read-depth-aware masked prediction objective: cells from the same biological state but profiled at different sequencing depths are jointly modeled, teaching the network to disentangle biological signal from technical depth confounders. Training data comprised more than 50 million human scRNA-seq profiles spanning multiple tissues, cell types, and experimental platforms sourced from public repositories. The model produces 512-dimensional embeddings for both cells and individual genes.

Benchmark evaluations reported in the Nature Methods paper cover seven downstream tasks. scFoundation achieved state-of-the-art results in gene expression enhancement, drug response prediction (using the DeepCDR framework), single-cell drug sensitivity prediction (via SCAD), perturbation response prediction (with GEARS), cell type annotation, gene module inference, and cross-dataset cell mapping. Performance gains were consistent across tasks, suggesting the representations are broadly useful rather than specialized to a narrow set of problems.

Applications

scFoundation is designed as a general-purpose embedding engine for single-cell transcriptomics workflows. Computational biologists can use cell-level embeddings for clustering, trajectory inference, and dataset integration without retraining. Cancer researchers can plug scFoundation embeddings into drug sensitivity prediction frameworks to forecast IC50 values at bulk or single-cell resolution. Functional genomics groups can use perturbation prediction pipelines to prioritize genetic or chemical interventions before running pooled screens. Cell atlas projects can use the cross-dataset cell mapping capability to harmonize profiles across multiple studies. Because the model exposes both cell and gene embeddings, it is also suited for gene module discovery and regulatory network inference.

Impact

scFoundation's publication in Nature Methods in 2024 helped establish the viability of large-scale pretraining for single-cell transcriptomics, a domain where prior foundation models had been limited by smaller training corpora and narrower gene coverage. The model's strong zero-shot performance across seven distinct task categories demonstrated that scale and pretraining objective design could produce representations that rival or exceed task-specific supervised models. One notable limitation is that scFoundation is trained exclusively on human data, limiting direct application to non-human organisms without retraining or adaptation. The model also does not currently incorporate spatial transcriptomics or multi-omic modalities, which are active frontiers in the field. Nonetheless, its open release and demonstrated generalizability have made it a reference point for subsequent work on single-cell foundation models.

Citation

Large-scale foundation model on single-cell transcriptomics

Hao, M., Gong, J., Zeng, X., Liu, C., Guo, Y., Cheng, X., Wang, T., Ma, J., Zhang, X., & Song, L. (2024). Large-scale foundation model on single-cell transcriptomics. Nature Methods, 21(8), 1481-1491.

DOI: 10.1038/s41592-024-02305-7

Overview

Key Features

Scale: 100 million trainable parameters trained on over 50 million human single-cell transcriptomic profiles, covering 19,264 genes — among the largest models of its kind at time of publication.

Asymmetric transformer design: The underlying xTrimoGene architecture uses an asymmetric transformer structure optimized for the high-dimensional, sparse nature of scRNA-seq count matrices, enabling efficient scaling without sacrificing representational capacity.

Read-depth-aware pretraining: A specialized pretraining objective links cells with different sequencing depths, directly modeling technical variation that commonly confounds cross-dataset analyses.

Dual-level embeddings: The model produces both cell-level and gene-level representations in a single forward pass, supporting a broad range of downstream tasks that require either cellular or molecular-level understanding.

Zero-shot transfer: Strong performance on multiple benchmarks is achievable without task-specific fine-tuning, demonstrating that the learned representations generalize across cell types, tissues, and experimental platforms.

Technical Details

Applications

Impact

Citation

Large-scale foundation model on single-cell transcriptomics

DOI: 10.1038/s41592-024-02305-7

scFoundation

Overview

Key Features

Technical Details

Applications

Impact

Citation

Large-scale foundation model on single-cell transcriptomics

Metrics

GitHub

Citations

Tags

Resources

scFoundation

Overview

Key Features

Technical Details

Applications

Impact

Citation

Large-scale foundation model on single-cell transcriptomics

Metrics

GitHub

Citations

Tags

Resources