SCimilarity

Single-cell foundation model trained by metric learning to embed scRNA-seq profiles for cell type annotation and similarity search in cell atlases.

Released: November 2024

The Human Cell Atlas and related single-cell initiatives have collectively profiled more than 100 million human cells across diseases, tissues, developmental stages, and perturbations. This vast accumulation of data represents an extraordinary resource — but only if researchers can efficiently query it to find cells that resemble a state of interest, annotate new datasets against it as a reference, and identify biologically similar cell states across tissues and conditions. SCimilarity provides exactly this capability: a foundation model trained by metric learning on a 22.7-million-cell corpus that produces cell embeddings where transcriptional similarity translates directly to geometric proximity, enabling instantaneous search across tens of millions of profiles.

SCimilarity was developed by Graham Heimberg, Tony Kuo, Daryle DePianto, and colleagues including Aviv Regev at Genentech, with contributions spanning Genentech's AI development, immunology, and regenerative medicine departments. The preprint was posted in July 2023 and the work was published in Nature in November 2024. The model's central contribution is a unified, searchable representation of single-cell gene expression that works across datasets, cell types, tissues, and sequencing platforms without requiring dataset-specific preprocessing such as batch correction or highly variable gene selection.

The practical impact of SCimilarity is best illustrated by the experimental validation described in the original paper. The authors queried a 22.7-million-cell corpus for cells transcriptionally similar to a macrophage subset originally identified in interstitial lung disease. SCimilarity returned highly similar macrophage profiles from other fibrotic diseases, from unexpected tissue contexts, and from a three-dimensional hydrogel cell culture system. This last finding was then used experimentally: the hydrogel system was repurposed to generate the interstitial-lung-disease-like macrophage state in vitro, validating that the computationally identified similarity was biologically real and practically actionable.

Key Features

Metric learning with triplet loss: SCimilarity is trained using a triplet loss function applied to cell expression triplets — anchor, positive (same cell type), and semi-hard negative (different cell type) — driving the model to learn an embedding space where cell type identity is encoded as geometric proximity.
Scalable cross-atlas querying: Trained cell embeddings can be indexed using approximate nearest-neighbor search, enabling instantaneous retrieval of the most transcriptionally similar cells from corpora of tens of millions of profiles without per-query retraining.
Zero-shot cell type annotation: Given a query cell, SCimilarity retrieves its nearest neighbors from the reference atlas and propagates their annotated cell type labels, producing automated annotation without requiring a custom classifier or fine-tuning on the target dataset.
Platform generalization: Despite being trained primarily on 10x Genomics Chromium data, SCimilarity accurately embeds and annotates cell profiles from multiple sequencing platforms including other droplet-based and plate-based scRNA-seq technologies.
No preprocessing required: Querying with SCimilarity does not require batch correction, highly variable gene selection, or other dataset harmonization steps — the embedding model handles these sources of variation implicitly.
Biologically interpretable embedding space: The embedding space is structured such that clusters correspond to known cell types, and the distances between cell type centroids reflect biological relationships, enabling comparison of cell states across diseases and tissues.

Technical Details

SCimilarity uses a deep autoencoder as the embedding backbone, mapping log-normalized gene expression profiles (spanning approximately 19,000 genes) to a low-dimensional embedding space. The model is trained jointly with a triplet loss that encourages same-cell-type profiles to be embedded closer together than profiles from different cell types. Critically, the training uses semi-hard negative triplet mining: for each anchor cell, the negative example is chosen to be the most similar cell from a different cell type that is not yet closer to the anchor than the positive — a strategy that provides informative gradients throughout training and prevents the model from converging to trivial solutions.

The training corpus consisted of 22.7 million cells assembled from 399 published scRNA-seq studies, manually curated for quality and annotated with harmonized cell type labels. Cell type labels were propagated and verified using author-provided annotations and cross-study concordance checks. The trained model produces fixed-length cell embedding vectors that can be indexed using approximate nearest-neighbor algorithms (such as FAISS) for efficient querying at scale. Published benchmarks demonstrated accurate cell type annotation across diverse tissue types and disease contexts, with the model correctly integrating and annotating cells from datasets not present in the training corpus. Experimental validation of a queried macrophage state in a 3D hydrogel system confirmed that SCimilarity's similarity measure reflects biologically meaningful cell identity relationships, not merely technical batch structure.

Applications

SCimilarity is designed for researchers working with large-scale single-cell atlases and new datasets that need to be contextualized against existing knowledge. The primary use case is automated cell type annotation: users provide their new scRNA-seq dataset, embed it with SCimilarity, and retrieve annotations from the reference atlas by nearest-neighbor lookup. This eliminates the need to train dataset-specific classifiers or manually curate marker gene lists for every new experiment. A second major application is cross-dataset discovery: given a cell state of interest — for example, a disease-associated macrophage subset or a rare progenitor population — SCimilarity can search the entire reference atlas to find where and under what conditions similar cells appear, generating hypotheses about shared biology across tissues and diseases. This is particularly powerful for rare cell types that appear in only a few datasets and whose broader tissue distribution is unknown.

Impact

SCimilarity's publication in Nature in 2024 established metric learning as a productive paradigm for building unified representations of single-cell gene expression at scale. By demonstrating that a foundation model trained on a diverse cell corpus can generalize to annotate unseen cell types and drive experimental discovery — the validation of the ILD macrophage state in vitro being a particularly compelling example — the work showed that single-cell foundation models are not merely interesting computationally but can actively guide biological research. SCimilarity is part of a growing ecosystem of single-cell foundation models (including scGPT, Geneformer, and Universal Cell Embeddings) but is distinctive in its focus on scalable querying and cross-dataset search rather than generative modeling or sequence prediction. The open-source release through Genentech's GitHub, combined with detailed documentation, has enabled broad adoption across academic and industry research groups working with the Human Cell Atlas and related resources.

Sources:

Citation

A cell atlas foundation model for scalable search of similar human cells

Heimberg, G., et al. (2024) A cell atlas foundation model for scalable search of similar human cells. Nature.

DOI: 10.1038/s41586-024-08411-y

Recent citations

Papers that recently cited this model.

Islands and bridges: Momentum contrastive coupling unifies discrete and continuous structure in single-cell omics
Zeyu Fu, Chunlin Chen, Keyang Zhang
Biomedical Signal Processing and Control · 2026
0
scDifformer: diffusion-based post-training for virtual cell modeling across large-scale single-cell data
Zhan Xiao, Wuke Wang, Xin Long, et al.
Nucleic Acids Research · Jul 2026
0
GPNMB-directed CAR T cell therapy against MiT/TFE-family fusion-driven solid tumors.
F. Zemp, Z. Breckenridge, Hyojin Song, et al.
Nature Cancer · Jul 2026
1

Top citations

The most-cited papers that cite this model.

Nicheformer: a foundation model for single-cell and spatial omics
A. Schaar, Alejandro Tejada-Lapuerta, G. Palla, et al.
bioRxiv · Oct 2024
124
The Human Cell Atlas from a cell census to a unified foundation model
Jennifer E. Rood, Samantha Wynne, L. Robson, et al.
Nature · Nov 2024
112Influential
Evaluating the Utilities of Foundation Models in Single‐Cell Data Analysis
Tianyu Liu, Kexing Li, Yuge Wang, et al.
bioRxiv · Feb 2024
53
Considerations for building and using integrated single-cell atlases
Karin Hrovatin, L. Sikkema, Vladimir A. Shitov, et al.
Nature Methods · Dec 2024
37
Large language models in biomedicine and healthcare
Juexiao Zhou, Haoyang Li, Siyuan Chen, et al.
npj Artificial Intelligence · Dec 2025
32

Citations

Total Citations125

Influential7

References80

GitHub

Stars258

Forks27

Open Issues22

Contributors4

Last Push4mo ago

LanguagePython

Fields of citing research

Biology84%
Computer Science77%
Medicine75%
Engineering3%
Environmental Science2%
Linguistics1%
Mathematics1%
Chemistry1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

78Open

Usability — can I run it?95

Reproducibility — can I retrain it?63

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Documentation Documentation Dataset

Key Features

Metric learning with triplet loss: SCimilarity is trained using a triplet loss function applied to cell expression triplets — anchor, positive (same cell type), and semi-hard negative (different cell type) — driving the model to learn an embedding space where cell type identity is encoded as geometric proximity.

Scalable cross-atlas querying: Trained cell embeddings can be indexed using approximate nearest-neighbor search, enabling instantaneous retrieval of the most transcriptionally similar cells from corpora of tens of millions of profiles without per-query retraining.

Zero-shot cell type annotation: Given a query cell, SCimilarity retrieves its nearest neighbors from the reference atlas and propagates their annotated cell type labels, producing automated annotation without requiring a custom classifier or fine-tuning on the target dataset.

Platform generalization: Despite being trained primarily on 10x Genomics Chromium data, SCimilarity accurately embeds and annotates cell profiles from multiple sequencing platforms including other droplet-based and plate-based scRNA-seq technologies.

No preprocessing required: Querying with SCimilarity does not require batch correction, highly variable gene selection, or other dataset harmonization steps — the embedding model handles these sources of variation implicitly.

Biologically interpretable embedding space: The embedding space is structured such that clusters correspond to known cell types, and the distances between cell type centroids reflect biological relationships, enabling comparison of cell states across diseases and tissues.

Technical Details

Applications

Impact

Sources:

SCimilarity

#Key Features

#Technical Details

#Applications

#Impact

Citation

A cell atlas foundation model for scalable search of similar human cells

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

SCimilarity

#Key Features

#Technical Details

#Applications

#Impact

Citation

A cell atlas foundation model for scalable search of similar human cells

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact