Arc Institute
Transformer model for predicting cellular responses to perturbations across diverse cell contexts, trained on over 267 million human single-cell profiles.
STATE (State Transition and Embedding) is a transformer-based virtual cell model developed at Arc Institute by Yusuf H. Roohani, Patrick D. Hsu, Lena Gilbert, Stephan Konermann, and colleagues. Released in June 2025, STATE addresses one of the central challenges in computational biology: predicting how individual cells respond to genetic, chemical, or signaling perturbations, and generalizing those predictions to cellular contexts never seen during training. Unlike prior perturbation models that treat each experiment in isolation, STATE explicitly accounts for cellular heterogeneity both within a single experiment and across the broader diversity of cell types, tissues, and experimental conditions.
The model is architected around two interlocking modules that separate the problem of understanding what a cell is from predicting how a perturbation changes it. The State Embedding (SE) model learns a continuous manifold of cellular identity from large-scale observational single-cell RNA-seq data, encoding each cell as a dense vector that captures gene expression variation while remaining robust to technical noise. The State Transition (ST) model then operates on these cell embeddings to predict where a cell will move on that manifold following a given perturbation. By pre-computing cell states from 167 million unperturbed human cells and learning perturbation effects from over 100 million perturbed cells across 70 distinct cellular contexts, STATE achieves generalization to novel cell types where no perturbation data exists.
STATE was evaluated against prior models on large-scale perturbation datasets including Tahoe-100M, the largest publicly available chemical perturbation dataset at the time of publication. On that benchmark, STATE improved discrimination of perturbation effects by more than 50% and doubled the accuracy of identifying true differentially expressed genes compared to existing approaches. To support rigorous comparison across methods, the team simultaneously introduced Cell-Eval, a comprehensive benchmarking framework that assesses whether models can detect biologically meaningful, cell-type-specific responses including cell survival effects.
The State Transition model is built on a bidirectional transformer with a LLaMA-style backbone. It operates on sets of cell embeddings rather than raw transcriptomes, using self-attention to model dependencies among cells within a perturbation context. The State Embedding model is a dense bidirectional transformer encoder whose training objective is to predict log-normalized gene expression from masked input; the SE decoder is a smaller multi-layer perceptron (MLP) that reconstructs gene expression from a combination of learned cell embeddings and target gene embeddings. This encoder-decoder design creates a smooth, noise-robust latent manifold of cellular states from which the ST model learns perturbation trajectories. All training data is standardized to 19,790 human protein-coding Ensembl genes; observations are normalized to a total UMI depth of 10,000 reads before being passed to either model component.
On the Tahoe-100M chemical perturbation benchmark, STATE demonstrated over 50% improvement in perturbation discrimination accuracy compared to the best prior methods, and achieved more than twice the accuracy in recovering true differentially expressed genes. On the Cell-Eval framework — which specifically probes whether models capture biologically meaningful, cell-type-specific effects such as survival responses — STATE showed substantially stronger performance than competing approaches across genetic, signaling, and chemical perturbation classes. The model also demonstrated the ability to transfer knowledge from contexts with abundant perturbation data to novel cellular environments where only observational data is available, validating the generalization claims of the dual-component design.
STATE is designed for researchers who need to predict gene expression responses to perturbations at scale, particularly when experimental resources limit the number of cell types or conditions that can be profiled directly. Drug discovery teams can use STATE to screen compounds or genetic targets in silico before committing to large CRISPR screens or chemical perturbation experiments. Computational biologists studying disease mechanisms can apply STATE to model how disease-relevant cell types would respond to therapeutic interventions. The Cell-Eval framework bundled with STATE enables fair benchmarking of perturbation prediction methods, making it a practical resource for methods developers comparing new approaches against the current state of the art. STATE's generalization to unobserved contexts is particularly valuable in rare disease research, where relevant cell types are often too scarce or difficult to culture for exhaustive perturbation profiling.
STATE represents Arc Institute's first public virtual cell model and advances the field significantly by demonstrating that perturbation generalization — predicting effects in unseen cellular contexts — is achievable at scale when cell identity and perturbation modeling are properly separated. The model's release coincided with the Virtual Cell Challenge, an Arc Institute-led initiative that used Cell-Eval as the evaluation framework to spur community competition toward building accurate virtual cell systems. By training on hundreds of millions of cells and achieving quantifiable improvements over prior art on standardized benchmarks, STATE sets a new empirical baseline for what perturbation models can achieve and provides the community with both model weights and evaluation infrastructure to build on. The primary limitation of the current version is the focus on human transcriptomic data; extension to other species, protein-level measurements, or spatially resolved transcriptomics remains an open direction for future work.