A generative cross-species foundation model for single-cell transcriptomics, trained on 112 million cells from 12 species spanning 1.5 billion years of evolution.
TranscriptFormer is a family of generative foundation models developed by the Chan Zuckerberg Initiative (CZI) for single-cell transcriptomics. Unlike prior single-cell foundation models that learn embeddings through masked gene prediction, TranscriptFormer takes a fundamentally generative approach: it autoregressively models the joint probability of gene identities and their expression counts within each cell. Trained on up to 112 million cells from 12 species spanning 1.53 billion years of evolution, TranscriptFormer represents the most evolutionarily diverse single-cell model published to date, covering organisms from the malaria parasite and yeast through sponges, insects, and vertebrates including humans.
The model's core innovation is its expression-aware attention mechanism, which incorporates transcript count information directly into the self-attention computation rather than treating expression levels as discrete tokens. By jointly modeling which genes are expressed and at what levels, TranscriptFormer captures the quantitative nature of transcriptomic data more faithfully than rank-based or masked approaches. This generative formulation also enables the model to function as a virtual instrument for biology, allowing researchers to prompt it to predict cell type-specific transcription factors, infer gene-gene interactions, and generate plausible expression profiles.
TranscriptFormer is available in three variants: TF-Metazoa (12 species, 112M cells), TF-Exemplar (5 species, 110M cells), and TF-Sapiens (human only, 57M cells). In zero-shot evaluations, these models achieve state-of-the-art performance on cell type classification, cross-species annotation transfer, and disease state identification, outperforming established models such as UCE, scGPT, and Geneformer on multiple benchmarks.
TranscriptFormer uses a decoder-only transformer architecture with 12 layers, 16 attention heads, and a model dimension of 2048. Gene embeddings are derived from frozen ESM-2 protein language model embeddings, providing a species-agnostic representation grounded in protein sequence similarity. An assay token is prepended to each cell sequence to encode sequencing technology (10x Chromium, Smart-seq2, etc.), and no positional encodings are used since single-cell data has no inherent gene ordering. Genes are randomly shuffled each training batch to enforce permutation invariance.
The model employs two coupled output heads: a gene decoder predicting a categorical distribution over the gene vocabulary, and a count decoder predicting a zero-truncated Poisson distribution over transcript counts conditioned on the predicted gene. Training data for TF-Metazoa comprised 112 million cells across 465 tissues, 1,865 cell types, and 129 disease states, sourced from CZ CELLxGENE, Tabula Sapiens, ZebraHub, and related repositories. Approximately 3.5 trillion tokens were processed over roughly 15 training epochs.
On out-of-distribution species classification across 5 unseen species, TF-Metazoa achieves average F1 of 0.778 versus 0.701 for UCE. On cross-species spermatogenesis transfer across 9 species, TF-Exemplar achieves average F1 of 0.480 versus 0.377 for UCE and 0.246 for ESM2-CE. In-distribution human cell type classification on Tabula Sapiens 2.0 reaches Macro F1 of 0.910, comparable to scGPT and Geneformer.
TranscriptFormer serves researchers conducting large-scale cell type annotation across tissues and species, reducing the manual annotation burden in single-cell atlasing projects. Its cross-species embedding space enables transfer of cell type labels between organisms, benefiting comparative biology studies and allowing well-annotated reference species to guide annotation of less-studied organisms. The zero-shot disease state detection capability is particularly valuable for early exploratory work, where disease-specific training data may not yet exist. Generative prompting applications include predicting cell type-specific transcription factors and inferring gene regulatory relationships, providing computational hypotheses for follow-up wet-lab validation.
TranscriptFormer marks a significant methodological shift in single-cell foundation modeling, moving from discriminative masked prediction to a fully generative, count-aware framework. Its evolutionary scope substantially exceeds prior cross-species models and demonstrates that representations grounded in protein sequence homology can effectively bridge species boundaries in transcriptomics. As of early 2026, the primary paper remains a bioRxiv preprint and has not undergone formal peer review. Practical limitations include GPU requirements (NVIDIA A100 40GB recommended), count clipping at 30 during training which may affect resolution of highly expressed genes, the absence of spatial transcriptomics integration, and fixed species vocabularies that restrict applicability to organisms outside the training set. Nonetheless, its release through the CZI Virtual Cells Platform alongside MIT-licensed code and weights positions TranscriptFormer as an accessible and extensible resource for the single-cell community.
Pearce, J. D., Simmonds, S. E., Mahmoudabadi, G., Krishnan, L., Palla, G., Istrate, A.-M., Tarashansky, A., Nelson, B., Valenzuela, O., Li, D., Quake, S. R., & Karaletsos, T. (2025). A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model. bioRxiv.
DOI: 10.1101/2025.04.25.650731