bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

SCimilarity

Genentech

Metric learning foundation model that embeds single-cell RNA-seq profiles into a unified space for scalable cell type annotation and cross-atlas similarity search across tens of millions of cells.

Released: 2024

Overview

The Human Cell Atlas and related single-cell initiatives have collectively profiled more than 100 million human cells across diseases, tissues, developmental stages, and perturbations. This vast accumulation of data represents an extraordinary resource — but only if researchers can efficiently query it to find cells that resemble a state of interest, annotate new datasets against it as a reference, and identify biologically similar cell states across tissues and conditions. SCimilarity provides exactly this capability: a foundation model trained by metric learning on a 22.7-million-cell corpus that produces cell embeddings where transcriptional similarity translates directly to geometric proximity, enabling instantaneous search across tens of millions of profiles.

SCimilarity was developed by Graham Heimberg, Tony Kuo, Daryle DePianto, and colleagues including Aviv Regev at Genentech, with contributions spanning Genentech's AI development, immunology, and regenerative medicine departments. The preprint was posted in July 2023 and the work was published in Nature in November 2024. The model's central contribution is a unified, searchable representation of single-cell gene expression that works across datasets, cell types, tissues, and sequencing platforms without requiring dataset-specific preprocessing such as batch correction or highly variable gene selection.

The practical impact of SCimilarity is best illustrated by the experimental validation described in the original paper. The authors queried a 22.7-million-cell corpus for cells transcriptionally similar to a macrophage subset originally identified in interstitial lung disease. SCimilarity returned highly similar macrophage profiles from other fibrotic diseases, from unexpected tissue contexts, and from a three-dimensional hydrogel cell culture system. This last finding was then used experimentally: the hydrogel system was repurposed to generate the interstitial-lung-disease-like macrophage state in vitro, validating that the computationally identified similarity was biologically real and practically actionable.

Key Features

  • Metric learning with triplet loss: SCimilarity is trained using a triplet loss function applied to cell expression triplets — anchor, positive (same cell type), and semi-hard negative (different cell type) — driving the model to learn an embedding space where cell type identity is encoded as geometric proximity.
  • Scalable cross-atlas querying: Trained cell embeddings can be indexed using approximate nearest-neighbor search, enabling instantaneous retrieval of the most transcriptionally similar cells from corpora of tens of millions of profiles without per-query retraining.
  • Zero-shot cell type annotation: Given a query cell, SCimilarity retrieves its nearest neighbors from the reference atlas and propagates their annotated cell type labels, producing automated annotation without requiring a custom classifier or fine-tuning on the target dataset.
  • Platform generalization: Despite being trained primarily on 10x Genomics Chromium data, SCimilarity accurately embeds and annotates cell profiles from multiple sequencing platforms including other droplet-based and plate-based scRNA-seq technologies.
  • No preprocessing required: Querying with SCimilarity does not require batch correction, highly variable gene selection, or other dataset harmonization steps — the embedding model handles these sources of variation implicitly.
  • Biologically interpretable embedding space: The embedding space is structured such that clusters correspond to known cell types, and the distances between cell type centroids reflect biological relationships, enabling comparison of cell states across diseases and tissues.

Technical Details

SCimilarity uses a deep autoencoder as the embedding backbone, mapping log-normalized gene expression profiles (spanning approximately 19,000 genes) to a low-dimensional embedding space. The model is trained jointly with a triplet loss that encourages same-cell-type profiles to be embedded closer together than profiles from different cell types. Critically, the training uses semi-hard negative triplet mining: for each anchor cell, the negative example is chosen to be the most similar cell from a different cell type that is not yet closer to the anchor than the positive — a strategy that provides informative gradients throughout training and prevents the model from converging to trivial solutions.

The training corpus consisted of 22.7 million cells assembled from 399 published scRNA-seq studies, manually curated for quality and annotated with harmonized cell type labels. Cell type labels were propagated and verified using author-provided annotations and cross-study concordance checks. The trained model produces fixed-length cell embedding vectors that can be indexed using approximate nearest-neighbor algorithms (such as FAISS) for efficient querying at scale. Published benchmarks demonstrated accurate cell type annotation across diverse tissue types and disease contexts, with the model correctly integrating and annotating cells from datasets not present in the training corpus. Experimental validation of a queried macrophage state in a 3D hydrogel system confirmed that SCimilarity's similarity measure reflects biologically meaningful cell identity relationships, not merely technical batch structure.

Applications

SCimilarity is designed for researchers working with large-scale single-cell atlases and new datasets that need to be contextualized against existing knowledge. The primary use case is automated cell type annotation: users provide their new scRNA-seq dataset, embed it with SCimilarity, and retrieve annotations from the reference atlas by nearest-neighbor lookup. This eliminates the need to train dataset-specific classifiers or manually curate marker gene lists for every new experiment. A second major application is cross-dataset discovery: given a cell state of interest — for example, a disease-associated macrophage subset or a rare progenitor population — SCimilarity can search the entire reference atlas to find where and under what conditions similar cells appear, generating hypotheses about shared biology across tissues and diseases. This is particularly powerful for rare cell types that appear in only a few datasets and whose broader tissue distribution is unknown.

Impact

SCimilarity's publication in Nature in 2024 established metric learning as a productive paradigm for building unified representations of single-cell gene expression at scale. By demonstrating that a foundation model trained on a diverse cell corpus can generalize to annotate unseen cell types and drive experimental discovery — the validation of the ILD macrophage state in vitro being a particularly compelling example — the work showed that single-cell foundation models are not merely interesting computationally but can actively guide biological research. SCimilarity is part of a growing ecosystem of single-cell foundation models (including scGPT, Geneformer, and Universal Cell Embeddings) but is distinctive in its focus on scalable querying and cross-dataset search rather than generative modeling or sequence prediction. The open-source release through Genentech's GitHub, combined with detailed documentation, has enabled broad adoption across academic and industry research groups working with the Human Cell Atlas and related resources.

Sources:

  • GitHub - Genentech/scimilarity
  • A cell atlas foundation model for scalable search of similar human cells | Nature
  • Scalable querying of human cell atlases via a foundational model | bioRxiv

Tags

cell type annotationcross-atlas queryingautoencoderfoundation modelcontrastive learningrepresentation learningsingle-cell transcriptomics

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDocumentation