scConcept

Single-cell foundation model learning technology-agnostic cell embeddings by contrasting cell views rather than reconstructing gene expression counts.

Released: October 2025

Parameters: 170 Million

scConcept (Single-cell Contrastive Cell Pre-training) is a transformer-based foundation model for single-cell transcriptomics developed by the Theis Lab at Helmholtz Munich and the Technical University of Munich. Introduced in a 2025 bioRxiv preprint, it is designed to produce robust, technology-agnostic representations of individual cells that generalize across the diverse count distributions and gene panels generated by different sequencing assays and platforms.

The work targets a specific weakness in most existing single-cell foundation models. Models such as scGPT and Geneformer borrow the masked-language-modeling and gene-level reconstruction objectives popularized in natural language processing, asking the network to predict masked or perturbed gene expression values. The scConcept authors argue that this reconstruction objective is poorly aligned with the actual downstream goal of single-cell pretraining, which is to learn high-quality cell-level embeddings rather than to recover gene counts. Optimizing for accurate reconstruction can spend model capacity on assay-specific noise and count statistics that do not transfer across technologies.

scConcept replaces reconstruction with a cell-level identification task drawn from contrastive learning. The model generates multiple augmented views of the same cell and learns to recognize which views originate from the same underlying cell while distinguishing them from other cells in the batch. This directly optimizes the geometry of the embedding space, encouraging representations that capture cell identity while remaining invariant to the technical variation introduced by different protocols and gene panels.

Key Features

Contrastive cell-level objective: Instead of reconstructing gene counts, scConcept contrasts multiple views of each cell, directly shaping a cell embedding space rather than optimizing a proxy reconstruction loss.
Technology-agnostic representations: Training across varied count distributions and gene panels yields embeddings that transfer between assays and platforms without per-dataset retuning.
Multi-species cross-assay model: The flagship checkpoint is pretrained on 16 species, supporting cross-species transfer in addition to human-only analysis.
Two released checkpoints: A 170M-parameter multi-species model and a lighter 30M-parameter human model give users a trade-off between coverage and compute footprint.
Open weights and code: Pretrained models are distributed on HuggingFace and source code is available on GitHub for embedding extraction, fine-tuning, and large-scale pretraining.

Technical Details

scConcept is a transformer encoder trained with a self-supervised contrastive identification objective. Two pretrained checkpoints are released. The flagship corpus360M[multi-species]-model170M has 170M parameters across 16 transformer layers with a hidden dimension of 1024, 16 attention heads, and a maximum of 20,000 tokens; it is trained on roughly 360 million cells drawn from CellxGene (2026) and scBaseCount (2025), spanning 16 species for cross-species applications. The smaller corpus40M-model30M has 30M parameters across 8 layers with a hidden dimension of 512, 8 attention heads, and a 1,000-token maximum; it is trained on roughly 40 million human cells from CellxGene (2023) and is recommended as the default for embedding extraction and lightweight adaptation. The implementation requires Python 3.12+ and optionally supports Flash Attention for accelerated training and inference.

Applications

scConcept is intended for embedding extraction from scRNA-seq data, fine-tuning and model adaptation for specialized tasks, and as a backbone for downstream single-cell analyses such as cell-type annotation, clustering, and dataset integration. Because its representations are designed to be technology-agnostic, it is well suited to building or querying cell atlases assembled from heterogeneous sources, where datasets differ in sequencing platform, gene panel, and count depth. The multi-species checkpoint additionally supports cross-species analysis and label transfer, while the smaller checkpoint serves researchers who need fast embeddings under modest compute budgets.

Impact

scConcept contributes to an ongoing reassessment of which pretraining objectives are appropriate for single-cell foundation models. By showing that a contrastive cell-identification task can replace the dominant reconstruction objective and yield representations that generalize across technologies, it challenges the assumption that masked-language-modeling recipes from NLP transfer cleanly to single-cell data. As a recent preprint, its empirical standing relative to established models such as scGPT, Geneformer, and scVI is still being evaluated by the community, and its conclusions await peer review. The release of open weights for both a large multi-species model and a compact human model lowers the barrier for adoption and independent benchmarking across the single-cell genomics community.

Citation

scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction

Preprint

Bahrami, M., et al. (2025) scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv.

DOI: 10.1101/2025.10.14.682419

Recent citations

Papers that recently cited this model.

Task-adapted biological foundation models uncover perturbation-centric representations
Elena Pareja-Lorente, Patrick Aloy
bioRxiv · Jul 2026
0
Benchmarking gene expression reconstruction from single-cell latent representations
Xiaotong Fu, Dominik Klein, E. Antipov, et al.
bioRxiv · Jun 2026
0
Cellpin enables reference-based imputation and denoising of spatial transcriptomes
Philipp Putze, Daniele Lucarelli, Deelaka Wellappili, et al.
bioRxiv · Jun 2026
0

Top citations

The most-cited papers that cite this model.

From modality-specific to compositional foundation models for cell biology.
Mojtaba Bahrami, Till Richter, Niklas A. Schmacke, et al.
Cell Systems · Feb 2026
3
Representation learning of single-cell RNA-seq data
Constantin Ahlmann-Eltze, Florian Barkmann, Jan Lause, et al.
RNA: A publication of the RNA Society · Jan 2026
1
multiVIB: A unified probabilistic contrastive learning framework for atlas-scale integration of single-cell multi-omics data
Yang Xu, Stephen J. Fleming, Brice Wang, et al.
bioRxiv · Dec 2025
0
Cellpin enables reference-based imputation and denoising of spatial transcriptomes
Philipp Putze, Daniele Lucarelli, Deelaka Wellappili, et al.
bioRxiv · Jun 2026
0
Multilayer network approaches to omics data integration in digital twins for cancer research
Hugo Chenel, Malvina Marku, Tim James, et al.
Frontiers in Systems Biology · Jun 2026
0

Citations

Total Citations10

Influential0

References101

GitHub

Stars37

Forks7

Open Issues3

Contributors1

Last Push1mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads103

Likes4

Last Modified1mo ago

Pipelinefeature-extraction

Fields of citing research

Computer Science100%
Biology88%
Medicine50%
Mathematics13%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

64Partial

Usability — can I run it?100

Reproducibility — can I retrain it?40

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Contrastive cell-level objective: Instead of reconstructing gene counts, scConcept contrasts multiple views of each cell, directly shaping a cell embedding space rather than optimizing a proxy reconstruction loss.

Technology-agnostic representations: Training across varied count distributions and gene panels yields embeddings that transfer between assays and platforms without per-dataset retuning.

Multi-species cross-assay model: The flagship checkpoint is pretrained on 16 species, supporting cross-species transfer in addition to human-only analysis.

Two released checkpoints: A 170M-parameter multi-species model and a lighter 30M-parameter human model give users a trade-off between coverage and compute footprint.

Open weights and code: Pretrained models are distributed on HuggingFace and source code is available on GitHub for embedding extraction, fine-tuning, and large-scale pretraining.

Technical Details

Applications

Impact

Citation

scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction

Preprint

Bahrami, M., et al. (2025) scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv.

DOI: 10.1101/2025.10.14.682419

scConcept

#Key Features

#Technical Details

#Applications

#Impact

Citation

scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

scConcept

#Key Features

#Technical Details

#Applications

#Impact

Citation

scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact