Zero-shot foundation model for single-cell gene expression that generates species-agnostic cell embeddings using protein language model representations of gene products.
UCE (Universal Cell Embeddings) is a 650-million parameter transformer foundation model for single-cell biology, developed by the Snap Lab (Jure Leskovec group) at Stanford University in collaboration with the Tabula Sapiens Consortium and Stephen Quake's lab. Released in late 2023, it addresses a fundamental challenge in single-cell genomics: how to compare and integrate cells across datasets, tissues, and species when gene nomenclature, sequencing protocols, and biological contexts all vary.
The core insight behind UCE is to represent genes not as learned vocabulary tokens tied to a particular organism, but through frozen ESM-2 protein language model embeddings of their protein products. Because ESM-2 encodes evolutionary information that is conserved across species, this design makes UCE inherently species-agnostic. A cell from a zebrafish and a cell from a human can be embedded into the same latent space without any organism-specific retraining.
UCE was trained in a fully self-supervised manner on the Integrated Mega-scale Atlas — 36 million cells drawn from more than 300 datasets spanning eight species. No cell type labels or data annotations were used during training. Despite this, the resulting embedding space exhibits emergent biological structure: developmental lineages, tissue hierarchies, and cell type relationships emerge without being explicitly taught.
UCE is a 33-layer transformer with an embedding dimension of 1,280. At inference time, 1,024 genes are sampled per cell with replacement, weighted by expression level, and arranged by chromosomal position. Frozen ESM-2 protein embeddings (dimension 5,120) serve as fixed gene representations, providing an evolutionarily informed starting point that the transformer then contextualizes within the cell's expression profile. The transformer produces a single cell-level embedding from this ordered gene sequence.
Training used a self-supervised masked gene prediction objective: 20% of expressed genes are masked, and the model is trained to predict their expression status via binary cross-entropy loss that combines the cell embedding with the protein representation of the masked gene. The 650M model was trained for 40 days on 24 NVIDIA A100 80GB GPUs.
On the Tabula Sapiens v2 benchmark, UCE outperformed Geneformer by 9.0% overall, 10.6% on biological conservation, and 7.4% on batch correction in a zero-shot setting. On the Single-Cell Integration Benchmark, it exceeded the next-best method by 13.9% overall, with gains of 16.2% on biological conservation and 10.1% on batch correction.
UCE is well suited for tasks where labeled reference data is scarce or where cross-dataset comparability is required. Cell type annotation workflows benefit from the model's ability to map newly sequenced cells to known types without requiring a labeled training set. Cross-species studies can embed cells from multiple organisms into a shared space to identify conserved or divergent cell states. Large-scale atlas integration projects can merge heterogeneous datasets from different labs, protocols, and species without retraining for each new dataset. The structured embedding space also supports novel cell type discovery by identifying unknown populations through proximity to annotated neighbors, and hypothesis generation around developmental relationships or tissue hierarchies.
UCE was released alongside the Chan Zuckerberg Initiative's Virtual Cells Platform, which provides a managed interface for embedding new datasets without local infrastructure. Model weights for both the 33-layer and 4-layer variants are publicly available via Figshare and downloaded automatically by the official GitHub repository. The model represents a significant step toward a truly universal representation of cell state, one that is not bound to a single organism or experimental protocol. Key limitations include reliance on protein annotations — genes without annotated protein products, such as many non-coding RNA genes, cannot be directly represented — and restriction to scRNA-seq data, with no native support for ATAC-seq, spatial transcriptomics, or other modalities. The full 650M model also requires substantial GPU memory (80GB recommended), though the 4-layer variant addresses resource-constrained settings.
Rosen, Y., et al. (2024) Universal Cell Embeddings: A Foundation Model for Cell Biology. bioRxiv.
DOI: 10.1101/2023.11.28.568918