Genentech
Metric learning foundation model that embeds single-cell RNA-seq profiles into a unified space for scalable cell type annotation and cross-atlas similarity search across tens of millions of cells.
The Human Cell Atlas and related single-cell initiatives have collectively profiled more than 100 million human cells across diseases, tissues, developmental stages, and perturbations. This vast accumulation of data represents an extraordinary resource — but only if researchers can efficiently query it to find cells that resemble a state of interest, annotate new datasets against it as a reference, and identify biologically similar cell states across tissues and conditions. SCimilarity provides exactly this capability: a foundation model trained by metric learning on a 22.7-million-cell corpus that produces cell embeddings where transcriptional similarity translates directly to geometric proximity, enabling instantaneous search across tens of millions of profiles.
SCimilarity was developed by Graham Heimberg, Tony Kuo, Daryle DePianto, and colleagues including Aviv Regev at Genentech, with contributions spanning Genentech's AI development, immunology, and regenerative medicine departments. The preprint was posted in July 2023 and the work was published in Nature in November 2024. The model's central contribution is a unified, searchable representation of single-cell gene expression that works across datasets, cell types, tissues, and sequencing platforms without requiring dataset-specific preprocessing such as batch correction or highly variable gene selection.
The practical impact of SCimilarity is best illustrated by the experimental validation described in the original paper. The authors queried a 22.7-million-cell corpus for cells transcriptionally similar to a macrophage subset originally identified in interstitial lung disease. SCimilarity returned highly similar macrophage profiles from other fibrotic diseases, from unexpected tissue contexts, and from a three-dimensional hydrogel cell culture system. This last finding was then used experimentally: the hydrogel system was repurposed to generate the interstitial-lung-disease-like macrophage state in vitro, validating that the computationally identified similarity was biologically real and practically actionable.
SCimilarity uses a deep autoencoder as the embedding backbone, mapping log-normalized gene expression profiles (spanning approximately 19,000 genes) to a low-dimensional embedding space. The model is trained jointly with a triplet loss that encourages same-cell-type profiles to be embedded closer together than profiles from different cell types. Critically, the training uses semi-hard negative triplet mining: for each anchor cell, the negative example is chosen to be the most similar cell from a different cell type that is not yet closer to the anchor than the positive — a strategy that provides informative gradients throughout training and prevents the model from converging to trivial solutions.
The training corpus consisted of 22.7 million cells assembled from 399 published scRNA-seq studies, manually curated for quality and annotated with harmonized cell type labels. Cell type labels were propagated and verified using author-provided annotations and cross-study concordance checks. The trained model produces fixed-length cell embedding vectors that can be indexed using approximate nearest-neighbor algorithms (such as FAISS) for efficient querying at scale. Published benchmarks demonstrated accurate cell type annotation across diverse tissue types and disease contexts, with the model correctly integrating and annotating cells from datasets not present in the training corpus. Experimental validation of a queried macrophage state in a 3D hydrogel system confirmed that SCimilarity's similarity measure reflects biologically meaningful cell identity relationships, not merely technical batch structure.
SCimilarity is designed for researchers working with large-scale single-cell atlases and new datasets that need to be contextualized against existing knowledge. The primary use case is automated cell type annotation: users provide their new scRNA-seq dataset, embed it with SCimilarity, and retrieve annotations from the reference atlas by nearest-neighbor lookup. This eliminates the need to train dataset-specific classifiers or manually curate marker gene lists for every new experiment. A second major application is cross-dataset discovery: given a cell state of interest — for example, a disease-associated macrophage subset or a rare progenitor population — SCimilarity can search the entire reference atlas to find where and under what conditions similar cells appear, generating hypotheses about shared biology across tissues and diseases. This is particularly powerful for rare cell types that appear in only a few datasets and whose broader tissue distribution is unknown.
SCimilarity's publication in Nature in 2024 established metric learning as a productive paradigm for building unified representations of single-cell gene expression at scale. By demonstrating that a foundation model trained on a diverse cell corpus can generalize to annotate unseen cell types and drive experimental discovery — the validation of the ILD macrophage state in vitro being a particularly compelling example — the work showed that single-cell foundation models are not merely interesting computationally but can actively guide biological research. SCimilarity is part of a growing ecosystem of single-cell foundation models (including scGPT, Geneformer, and Universal Cell Embeddings) but is distinctive in its focus on scalable querying and cross-dataset search rather than generative modeling or sequence prediction. The open-source release through Genentech's GitHub, combined with detailed documentation, has enabled broad adoption across academic and industry research groups working with the Human Cell Atlas and related resources.
Sources: