Perturbation-trained single-cell foundation models (up to 3B parameters) that jointly model genes, cells, and compounds for precision oncology tasks.
Tahoe-x1 (Tx1) is a family of single-cell foundation models developed by Tahoe Therapeutics and released as a bioRxiv preprint in October 2025. While foundation models have reshaped natural language processing and computer vision, their potential in single-cell biology — and particularly in complex diseases such as cancer — has remained comparatively underexplored. Tx1 targets this gap directly, scaling perturbation-trained single-cell models up to 3 billion parameters and orienting their training and evaluation toward cancer-relevant tasks.
The central idea behind Tx1 is to learn not just from baseline transcriptional states but from how cells respond to perturbation. The models are pretrained on large-scale single-cell transcriptomic data, including the Tahoe-100M perturbation compendium, and then fine-tuned for disease-relevant downstream applications. Rather than treating gene expression in isolation, Tx1 jointly learns representations of genes, cells, and compounds, allowing a single backbone to reason about both cellular identity and the effect of chemical interventions.
Tx1 sits alongside earlier single-cell foundation models such as Geneformer, scGPT, scFoundation, and AIDO.Cell, but differentiates itself through its perturbation-centric training signal, its explicit modeling of compounds via a drug token, and its focus on precision oncology benchmarks. Tahoe Therapeutics released the model as an unusually open package — pretrained checkpoints, training code, and evaluation workflows — to accelerate community work on perturbation-trained single-cell models.
Tx1 is a transformer-based single-cell foundation model trained with a masked-expression generative objective. The key architectural addition is a drug token that is incorporated alongside gene and cell representations, allowing the model to condition expression predictions on chemical perturbations and to jointly learn gene, cell, and compound embeddings. Pretraining draws on roughly 266 million single-cell transcriptomic profiles, including the Tahoe-100M perturbation compendium, before task-specific fine-tuning. Architectural optimizations, data-loader refinements, and efficient training strategies together yield a reported 3-30x improvement in compute efficiency relative to prior cell-state model implementations. The released family spans three sizes — approximately 70M, 1.3B, and 3B parameters — and the authors report state-of-the-art performance across all four evaluated benchmarks: overall and context-specific gene essentiality, hallmarks-of-cancer gene identification, cell-type classification, and perturbation-response prediction in held-out cellular contexts.
Tx1 is aimed at precision oncology and broader perturbation biology. By predicting context-specific gene essentiality, it can help prioritize candidate therapeutic targets in particular cancer backgrounds, while its hallmarks-of-cancer gene identification supports mechanistic interpretation of tumor biology. The model's ability to predict perturbation responses in held-out cellular contexts is directly useful for in silico screening — anticipating how cells will respond to genetic or chemical interventions before committing to expensive wet-lab experiments. Cell-type classification rounds out a toolkit relevant to computational biologists, cancer researchers, and drug discovery teams analyzing single-cell and perturbation datasets.
Tahoe-x1 demonstrates that single-cell foundation models can be scaled to billions of parameters and trained on perturbation data while remaining compute-efficient and competitive on cancer-relevant benchmarks. Its joint modeling of genes, cells, and compounds via a drug token is a notable design choice that extends single-cell models toward chemical-perturbation reasoning. By releasing pretrained checkpoints, training code, and evaluation workflows under permissive licenses, Tahoe Therapeutics lowers the barrier for other groups to build on perturbation-trained models for precision oncology. As a 2025 preprint, its independent validation and downstream adoption are still developing, and the reported state-of-the-art results await broader external benchmarking.
Gandhi, S., et al. (2025) Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters. bioRxiv.
DOI: 10.1101/2025.10.23.683759Papers that recently cited this model.
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
Zijun Song, Yujia Xiang, Zhi-yi Song, et al.
bioRxiv · Jun 2026
Xi Xi, Chen Li, Lei Wei, et al.
Trends in Genetics · Jun 2026
The most-cited papers that cite this model.
Payam Dibaeinia, Sudarshan Babu, Mei Knudson, et al.
bioRxiv · Feb 2026
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Feb 2026
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
Xi Xi, Chen Li, Lei Wei, et al.
Trends in Genetics · Jun 2026
Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, et al.
May 2026
Share of papers citing this model.