Tahoe-x1

Perturbation-trained single-cell foundation models (up to 3B parameters) that jointly model genes, cells, and compounds for precision oncology tasks.

Released: October 2025

Parameters: 3 Billion

Tahoe-x1 (Tx1) is a family of single-cell foundation models developed by Tahoe Therapeutics and released as a bioRxiv preprint in October 2025. While foundation models have reshaped natural language processing and computer vision, their potential in single-cell biology — and particularly in complex diseases such as cancer — has remained comparatively underexplored. Tx1 targets this gap directly, scaling perturbation-trained single-cell models up to 3 billion parameters and orienting their training and evaluation toward cancer-relevant tasks.

The central idea behind Tx1 is to learn not just from baseline transcriptional states but from how cells respond to perturbation. The models are pretrained on large-scale single-cell transcriptomic data, including the Tahoe-100M perturbation compendium, and then fine-tuned for disease-relevant downstream applications. Rather than treating gene expression in isolation, Tx1 jointly learns representations of genes, cells, and compounds, allowing a single backbone to reason about both cellular identity and the effect of chemical interventions.

Tx1 sits alongside earlier single-cell foundation models such as Geneformer, scGPT, scFoundation, and AIDO.Cell, but differentiates itself through its perturbation-centric training signal, its explicit modeling of compounds via a drug token, and its focus on precision oncology benchmarks. Tahoe Therapeutics released the model as an unusually open package — pretrained checkpoints, training code, and evaluation workflows — to accelerate community work on perturbation-trained single-cell models.

Key Features

Perturbation-trained pretraining: Tx1 is pretrained on large-scale single-cell transcriptomic data including the Tahoe-100M perturbation compendium, so the models learn from how cells respond to interventions rather than from baseline states alone.
Joint gene, cell, and compound modeling: A masked-expression generative objective augmented with a drug token lets a single model jointly represent genes, cells, and chemical compounds, enabling flexible adaptation across downstream tasks.
Scaling to 3 billion parameters: Three checkpoints are released — Tx1-70M (~70M), Tx1-1B (~1.3B), and Tx1-3B (~3B parameters) — spanning a range of compute-performance trade-offs.
High compute efficiency: Through architectural optimizations, data-loader refinements, and efficient training strategies, Tx1 reaches 3-30x higher compute efficiency than prior implementations of cell-state models.
Cancer-focused benchmarking: The models are evaluated on four disease-relevant tasks — gene essentiality prediction, hallmarks-of-cancer gene identification, cell-type classification, and perturbation-response prediction in held-out contexts.
Open release: Pretrained checkpoints and code are released under Apache-2.0, with an interactive HuggingFace Space demo for hands-on exploration.

Technical Details

Tx1 is a transformer-based single-cell foundation model trained with a masked-expression generative objective. The key architectural addition is a drug token that is incorporated alongside gene and cell representations, allowing the model to condition expression predictions on chemical perturbations and to jointly learn gene, cell, and compound embeddings. Pretraining draws on roughly 266 million single-cell transcriptomic profiles, including the Tahoe-100M perturbation compendium, before task-specific fine-tuning. Architectural optimizations, data-loader refinements, and efficient training strategies together yield a reported 3-30x improvement in compute efficiency relative to prior cell-state model implementations. The released family spans three sizes — approximately 70M, 1.3B, and 3B parameters — and the authors report state-of-the-art performance across all four evaluated benchmarks: overall and context-specific gene essentiality, hallmarks-of-cancer gene identification, cell-type classification, and perturbation-response prediction in held-out cellular contexts.

Applications

Tx1 is aimed at precision oncology and broader perturbation biology. By predicting context-specific gene essentiality, it can help prioritize candidate therapeutic targets in particular cancer backgrounds, while its hallmarks-of-cancer gene identification supports mechanistic interpretation of tumor biology. The model's ability to predict perturbation responses in held-out cellular contexts is directly useful for in silico screening — anticipating how cells will respond to genetic or chemical interventions before committing to expensive wet-lab experiments. Cell-type classification rounds out a toolkit relevant to computational biologists, cancer researchers, and drug discovery teams analyzing single-cell and perturbation datasets.

Impact

Tahoe-x1 demonstrates that single-cell foundation models can be scaled to billions of parameters and trained on perturbation data while remaining compute-efficient and competitive on cancer-relevant benchmarks. Its joint modeling of genes, cells, and compounds via a drug token is a notable design choice that extends single-cell models toward chemical-perturbation reasoning. By releasing pretrained checkpoints, training code, and evaluation workflows under permissive licenses, Tahoe Therapeutics lowers the barrier for other groups to build on perturbation-trained models for precision oncology. As a 2025 preprint, its independent validation and downstream adoption are still developing, and the reported state-of-the-art results await broader external benchmarking.

Citation

Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters

Preprint

Gandhi, S., et al. (2025) Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters. bioRxiv.

DOI: 10.1101/2025.10.23.683759

Recent citations

Papers that recently cited this model.

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
0
PertDiffBench: Benchmarking Diffusion Models for Single-Cell Perturbation Response Prediction
Zijun Song, Yujia Xiang, Zhi-yi Song, et al.
bioRxiv · Jun 2026
0
Identifying fate-determining transcription factors with single-cell omics.
Xi Xi, Chen Li, Lei Wei, et al.
Trends in Genetics · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Virtual Cells Need Context, Not Just Scale
Payam Dibaeinia, Sudarshan Babu, Mei Knudson, et al.
bioRxiv · Feb 2026
3
Discrete Diffusion for Single-Cell Gene Expression Modeling
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Feb 2026
2
Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
0
Identifying fate-determining transcription factors with single-cell omics.
Xi Xi, Chen Li, Lei Wei, et al.
Trends in Genetics · Jun 2026
0
Effective Biological Representation Learning by Masking Gene Expression
Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, et al.
May 2026
0

Citations

Total Citations13

Influential2

References28

GitHub

Stars158

Forks25

Open Issues3

Contributors9

Last Push21d ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads54

Likes73

Last Modified8mo ago

Fields of citing research

Computer Science92%
Biology85%
Medicine31%
Engineering8%
Materials Science8%
Chemistry8%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

95Open

Usability — can I run it?95

Reproducibility — can I retrain it?92

Model Openness Framework

Unclassified

No formal model card / data card

Resources

GitHub Repository Research Paper Official Website HuggingFace Model Demo Dataset

Key Features

Perturbation-trained pretraining: Tx1 is pretrained on large-scale single-cell transcriptomic data including the Tahoe-100M perturbation compendium, so the models learn from how cells respond to interventions rather than from baseline states alone.

Joint gene, cell, and compound modeling: A masked-expression generative objective augmented with a drug token lets a single model jointly represent genes, cells, and chemical compounds, enabling flexible adaptation across downstream tasks.

Scaling to 3 billion parameters: Three checkpoints are released — Tx1-70M (~70M), Tx1-1B (~1.3B), and Tx1-3B (~3B parameters) — spanning a range of compute-performance trade-offs.

High compute efficiency: Through architectural optimizations, data-loader refinements, and efficient training strategies, Tx1 reaches 3-30x higher compute efficiency than prior implementations of cell-state models.

Cancer-focused benchmarking: The models are evaluated on four disease-relevant tasks — gene essentiality prediction, hallmarks-of-cancer gene identification, cell-type classification, and perturbation-response prediction in held-out contexts.

Open release: Pretrained checkpoints and code are released under Apache-2.0, with an interactive HuggingFace Space demo for hands-on exploration.

Technical Details

Applications

Impact

Top citations

The most-cited papers that cite this model.

Effective Biological Representation Learning by Masking Gene Expression

Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, et al.

May 2026

Tahoe-x1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters

Recent citations

Top citations

Effective Biological Representation Learning by Masking Gene Expression

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Tahoe-x1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters

Recent citations

Top citations

Effective Biological Representation Learning by Masking Gene Expression

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact