Transformer-based contrastive pretraining framework that learns technology-agnostic single-cell representations by contrasting cell views instead of reconstructing gene expression.
scConcept (Single-cell Contrastive Cell Pre-training) is a transformer-based foundation model for single-cell transcriptomics developed by the Theis Lab at Helmholtz Munich and the Technical University of Munich. Introduced in a 2025 bioRxiv preprint, it is designed to produce robust, technology-agnostic representations of individual cells that generalize across the diverse count distributions and gene panels generated by different sequencing assays and platforms.
The work targets a specific weakness in most existing single-cell foundation models. Models such as scGPT and Geneformer borrow the masked-language-modeling and gene-level reconstruction objectives popularized in natural language processing, asking the network to predict masked or perturbed gene expression values. The scConcept authors argue that this reconstruction objective is poorly aligned with the actual downstream goal of single-cell pretraining, which is to learn high-quality cell-level embeddings rather than to recover gene counts. Optimizing for accurate reconstruction can spend model capacity on assay-specific noise and count statistics that do not transfer across technologies.
scConcept replaces reconstruction with a cell-level identification task drawn from contrastive learning. The model generates multiple augmented views of the same cell and learns to recognize which views originate from the same underlying cell while distinguishing them from other cells in the batch. This directly optimizes the geometry of the embedding space, encouraging representations that capture cell identity while remaining invariant to the technical variation introduced by different protocols and gene panels.
scConcept is a transformer encoder trained with a self-supervised contrastive identification objective. Two pretrained checkpoints are released. The flagship corpus360M[multi-species]-model170M has 170M parameters across 16 transformer layers with a hidden dimension of 1024, 16 attention heads, and a maximum of 20,000 tokens; it is trained on roughly 360 million cells drawn from CellxGene (2026) and scBaseCount (2025), spanning 16 species for cross-species applications. The smaller corpus40M-model30M has 30M parameters across 8 layers with a hidden dimension of 512, 8 attention heads, and a 1,000-token maximum; it is trained on roughly 40 million human cells from CellxGene (2023) and is recommended as the default for embedding extraction and lightweight adaptation. The implementation requires Python 3.12+ and optionally supports Flash Attention for accelerated training and inference.
scConcept is intended for embedding extraction from scRNA-seq data, fine-tuning and model adaptation for specialized tasks, and as a backbone for downstream single-cell analyses such as cell-type annotation, clustering, and dataset integration. Because its representations are designed to be technology-agnostic, it is well suited to building or querying cell atlases assembled from heterogeneous sources, where datasets differ in sequencing platform, gene panel, and count depth. The multi-species checkpoint additionally supports cross-species analysis and label transfer, while the smaller checkpoint serves researchers who need fast embeddings under modest compute budgets.
scConcept contributes to an ongoing reassessment of which pretraining objectives are appropriate for single-cell foundation models. By showing that a contrastive cell-identification task can replace the dominant reconstruction objective and yield representations that generalize across technologies, it challenges the assumption that masked-language-modeling recipes from NLP transfer cleanly to single-cell data. As a recent preprint, its empirical standing relative to established models such as scGPT, Geneformer, and scVI is still being evaluated by the community, and its conclusions await peer review. The release of open weights for both a large multi-species model and a compact human model lowers the barrier for adoption and independent benchmarking across the single-cell genomics community.
Bahrami, M., et al. (2025) scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv.
DOI: 10.1101/2025.10.14.682419