Overview

UCE (Universal Cell Embeddings) is a 650-million parameter transformer foundation model for single-cell biology, developed by the Snap Lab (Jure Leskovec group) at Stanford University in collaboration with the Tabula Sapiens Consortium and Stephen Quake's lab. Released in late 2023, it addresses a fundamental challenge in single-cell genomics: how to compare and integrate cells across datasets, tissues, and species when gene nomenclature, sequencing protocols, and biological contexts all vary.

The core insight behind UCE is to represent genes not as learned vocabulary tokens tied to a particular organism, but through frozen ESM-2 protein language model embeddings of their protein products. Because ESM-2 encodes evolutionary information that is conserved across species, this design makes UCE inherently species-agnostic. A cell from a zebrafish and a cell from a human can be embedded into the same latent space without any organism-specific retraining.

UCE was trained in a fully self-supervised manner on the Integrated Mega-scale Atlas — 36 million cells drawn from more than 300 datasets spanning eight species. No cell type labels or data annotations were used during training. Despite this, the resulting embedding space exhibits emergent biological structure: developmental lineages, tissue hierarchies, and cell type relationships emerge without being explicitly taught.

Key Features

Zero-shot generalization: New cells from any organism can be mapped into UCE's embedding space without additional training, fine-tuning, or labeled reference data.
Species-agnostic gene representation: Gene identity is encoded through frozen ESM-2 protein embeddings, decoupling the model from species-specific gene names and genome annotations.
Massive training scale: Trained on 36 million cells from more than 300 datasets, covering eight species and more than 1,000 uniquely named cell types.
Emergent biological structure: The embedding space captures developmental lineages, tissue organization, and cell type hierarchies without explicit supervision on these relationships.
Chromosomal positional encoding: Genes are ordered by chromosomal position using special chromosome-level tokens, preserving genomic context during inference.
Two model scales: A full 33-layer 650M parameter model and a lightweight 4-layer variant are available, enabling deployment across a range of computational environments.

Technical Details

UCE is a 33-layer transformer with an embedding dimension of 1,280. At inference time, 1,024 genes are sampled per cell with replacement, weighted by expression level, and arranged by chromosomal position. Frozen ESM-2 protein embeddings (dimension 5,120) serve as fixed gene representations, providing an evolutionarily informed starting point that the transformer then contextualizes within the cell's expression profile. The transformer produces a single cell-level embedding from this ordered gene sequence.

Training used a self-supervised masked gene prediction objective: 20% of expressed genes are masked, and the model is trained to predict their expression status via binary cross-entropy loss that combines the cell embedding with the protein representation of the masked gene. The 650M model was trained for 40 days on 24 NVIDIA A100 80GB GPUs.

On the Tabula Sapiens v2 benchmark, UCE outperformed Geneformer by 9.0% overall, 10.6% on biological conservation, and 7.4% on batch correction in a zero-shot setting. On the Single-Cell Integration Benchmark, it exceeded the next-best method by 13.9% overall, with gains of 16.2% on biological conservation and 10.1% on batch correction.

Applications

UCE is well suited for tasks where labeled reference data is scarce or where cross-dataset comparability is required. Cell type annotation workflows benefit from the model's ability to map newly sequenced cells to known types without requiring a labeled training set. Cross-species studies can embed cells from multiple organisms into a shared space to identify conserved or divergent cell states. Large-scale atlas integration projects can merge heterogeneous datasets from different labs, protocols, and species without retraining for each new dataset. The structured embedding space also supports novel cell type discovery by identifying unknown populations through proximity to annotated neighbors, and hypothesis generation around developmental relationships or tissue hierarchies.

Impact

UCE was released alongside the Chan Zuckerberg Initiative's Virtual Cells Platform, which provides a managed interface for embedding new datasets without local infrastructure. Model weights for both the 33-layer and 4-layer variants are publicly available via Figshare and downloaded automatically by the official GitHub repository. The model represents a significant step toward a truly universal representation of cell state, one that is not bound to a single organism or experimental protocol. Key limitations include reliance on protein annotations — genes without annotated protein products, such as many non-coding RNA genes, cannot be directly represented — and restriction to scRNA-seq data, with no native support for ATAC-seq, spatial transcriptomics, or other modalities. The full 650M model also requires substantial GPU memory (80GB recommended), though the 4-layer variant addresses resource-constrained settings.

Overview

Key Features

Zero-shot generalization: New cells from any organism can be mapped into UCE's embedding space without additional training, fine-tuning, or labeled reference data.

Species-agnostic gene representation: Gene identity is encoded through frozen ESM-2 protein embeddings, decoupling the model from species-specific gene names and genome annotations.

Massive training scale: Trained on 36 million cells from more than 300 datasets, covering eight species and more than 1,000 uniquely named cell types.

Emergent biological structure: The embedding space captures developmental lineages, tissue organization, and cell type hierarchies without explicit supervision on these relationships.

Chromosomal positional encoding: Genes are ordered by chromosomal position using special chromosome-level tokens, preserving genomic context during inference.

Two model scales: A full 33-layer 650M parameter model and a lightweight 4-layer variant are available, enabling deployment across a range of computational environments.

Technical Details

Applications

Impact

UCE

Overview

Key Features

Technical Details

Applications

Impact

Citation

Universal Cell Embeddings: A Foundation Model for Cell Biology

Metrics

GitHub

Citations

Tags

Resources

UCE

Overview

Key Features

Technical Details

Applications

Impact

Citation

Universal Cell Embeddings: A Foundation Model for Cell Biology

Metrics

GitHub

Citations

Tags

Resources