STATE

Virtual cell transformer that predicts how cells respond to genetic, chemical, or signaling perturbations, generalizing to unseen cellular contexts.

Released: June 2025

STATE (State Transition and Embedding) is a transformer-based virtual cell model developed at Arc Institute by Yusuf H. Roohani, Patrick D. Hsu, Lena Gilbert, Stephan Konermann, and colleagues. Released in June 2025, STATE addresses one of the central challenges in computational biology: predicting how individual cells respond to genetic, chemical, or signaling perturbations, and generalizing those predictions to cellular contexts never seen during training. Unlike prior perturbation models that treat each experiment in isolation, STATE explicitly accounts for cellular heterogeneity both within a single experiment and across the broader diversity of cell types, tissues, and experimental conditions.

The model is architected around two interlocking modules that separate the problem of understanding what a cell is from predicting how a perturbation changes it. The State Embedding (SE) model learns a continuous manifold of cellular identity from large-scale observational single-cell RNA-seq data, encoding each cell as a dense vector that captures gene expression variation while remaining robust to technical noise. The State Transition (ST) model then operates on these cell embeddings to predict where a cell will move on that manifold following a given perturbation. By pre-computing cell states from 167 million unperturbed human cells and learning perturbation effects from over 100 million perturbed cells across 70 distinct cellular contexts, STATE achieves generalization to novel cell types where no perturbation data exists.

STATE was evaluated against prior models on large-scale perturbation datasets including Tahoe-100M, the largest publicly available chemical perturbation dataset at the time of publication. On that benchmark, STATE improved discrimination of perturbation effects by more than 50% and doubled the accuracy of identifying true differentially expressed genes compared to existing approaches. To support rigorous comparison across methods, the team simultaneously introduced Cell-Eval, a comprehensive benchmarking framework that assesses whether models can detect biologically meaningful, cell-type-specific responses including cell survival effects.

Key Features

Dual-component architecture: STATE separates cellular identity modeling (State Embedding) from perturbation effect modeling (State Transition), enabling each component to be trained on the most appropriate data and combined flexibly at inference time.
Set-based perturbation prediction: The State Transition model uses self-attention over sets of cells rather than individual transcriptomes, capturing biological and technical heterogeneity within a perturbation experiment without relying on distributional assumptions about the cell population.
Massive training scale: The SE model is pretrained on 167 million unperturbed human cells spanning diverse tissues and datasets, while the ST model learns from over 100 million perturbed cells across 70 cellular contexts, providing broad coverage of human biology.
Zero-shot generalization to novel contexts: By leveraging cell embeddings trained on observational data, STATE can predict perturbation effects in completely new cell types where no perturbation experiments were conducted during training.
Cell-Eval benchmarking framework: STATE is released alongside Cell-Eval, a standardized evaluation suite that measures perturbation prediction quality across multiple criteria including discrimination accuracy, differential gene expression recovery, and cell-type-specific response detection.
Standardized gene vocabulary: All datasets are harmonized to the 19,790 human protein-coding Ensembl genes and normalized to a total UMI depth of 10,000, enabling consistent cross-dataset training and evaluation.

Technical Details

The State Transition model is built on a bidirectional transformer with a LLaMA-style backbone. It operates on sets of cell embeddings rather than raw transcriptomes, using self-attention to model dependencies among cells within a perturbation context. The State Embedding model is a dense bidirectional transformer encoder whose training objective is to predict log-normalized gene expression from masked input; the SE decoder is a smaller multi-layer perceptron (MLP) that reconstructs gene expression from a combination of learned cell embeddings and target gene embeddings. This encoder-decoder design creates a smooth, noise-robust latent manifold of cellular states from which the ST model learns perturbation trajectories. All training data is standardized to 19,790 human protein-coding Ensembl genes; observations are normalized to a total UMI depth of 10,000 reads before being passed to either model component.

On the Tahoe-100M chemical perturbation benchmark, STATE demonstrated over 50% improvement in perturbation discrimination accuracy compared to the best prior methods, and achieved more than twice the accuracy in recovering true differentially expressed genes. On the Cell-Eval framework — which specifically probes whether models capture biologically meaningful, cell-type-specific effects such as survival responses — STATE showed substantially stronger performance than competing approaches across genetic, signaling, and chemical perturbation classes. The model also demonstrated the ability to transfer knowledge from contexts with abundant perturbation data to novel cellular environments where only observational data is available, validating the generalization claims of the dual-component design.

Applications

STATE is designed for researchers who need to predict gene expression responses to perturbations at scale, particularly when experimental resources limit the number of cell types or conditions that can be profiled directly. Drug discovery teams can use STATE to screen compounds or genetic targets in silico before committing to large CRISPR screens or chemical perturbation experiments. Computational biologists studying disease mechanisms can apply STATE to model how disease-relevant cell types would respond to therapeutic interventions. The Cell-Eval framework bundled with STATE enables fair benchmarking of perturbation prediction methods, making it a practical resource for methods developers comparing new approaches against the current state of the art. STATE's generalization to unobserved contexts is particularly valuable in rare disease research, where relevant cell types are often too scarce or difficult to culture for exhaustive perturbation profiling.

Impact

STATE represents Arc Institute's first public virtual cell model and advances the field significantly by demonstrating that perturbation generalization — predicting effects in unseen cellular contexts — is achievable at scale when cell identity and perturbation modeling are properly separated. The model's release coincided with the Virtual Cell Challenge, an Arc Institute-led initiative that used Cell-Eval as the evaluation framework to spur community competition toward building accurate virtual cell systems. By training on hundreds of millions of cells and achieving quantifiable improvements over prior art on standardized benchmarks, STATE sets a new empirical baseline for what perturbation models can achieve and provides the community with both model weights and evaluation infrastructure to build on. The primary limitation of the current version is the focus on human transcriptomic data; extension to other species, protein-level measurements, or spatially resolved transcriptomics remains an open direction for future work.

Citation

Predicting cellular responses to perturbation across diverse contexts with State

Preprint

Adduri, A., et al. (2025) Predicting cellular responses to perturbation across diverse contexts with State. bioRxiv.

DOI: 10.1101/2025.06.26.661135

Recent citations

Papers that recently cited this model.

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap
Youssef Marrakchi, Davide D'Ascenzo, S. Montesano
Jul 2026
0
Unbalanced Perturbation Dynamics For Cell Fate Design
Qiangwei Peng, Yuchuan Wang, Jianzhen Li, et al.
bioRxiv · Jul 2026
0
Tabular Foundation Models Are Competitive Cellular Perturbation Predictors Across Biological Scales
G. Palla, Alexander Hillsley, Yang-Joon Kim, et al.
bioRxiv · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models
Flavia Pedrocchi, Florian Barkmann, A. Joudaki, et al.
bioRxiv · Mar 2026
5
What Makes a Representation Good for Single-Cell Perturbation Prediction?
Wenkang Jiang, Yuhang Liu, Yichao Cai, et al.
May 2026
1
Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction
S. Bhattacharya, Christian Gensbigler, Shaamil Karim, et al.
bioRxiv · Jun 2026
0
Mechanisms Matter: Transportability of Cellular Perturbation Effects
Shi-ang Qi, Paidamoyo Chapfuwa
bioRxiv · May 2026
0
VCBench: A Multi-Dimensional Benchmark for Single-Cell Foundation Models
L. Weidener, M. Brkić, M. Jovanović, et al.
bioRxiv · Jun 2026
0Influential

Citations

Total Citations116

Influential20

References0

GitHub

Stars622

Forks160

Open Issues58

Contributors16

Last Push1d ago

LanguagePython

HuggingFace

Downloads255

Likes17

Last Modified5mo ago

Fields of citing research

Computer Science95%
Biology88%
Medicine35%
Chemistry7%
Engineering5%
Mathematics5%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

21Closed

Usability — can I run it?13

Reproducibility — can I retrain it?31

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website HuggingFace Model Dataset Dataset

Key Features

Dual-component architecture: STATE separates cellular identity modeling (State Embedding) from perturbation effect modeling (State Transition), enabling each component to be trained on the most appropriate data and combined flexibly at inference time.

Set-based perturbation prediction: The State Transition model uses self-attention over sets of cells rather than individual transcriptomes, capturing biological and technical heterogeneity within a perturbation experiment without relying on distributional assumptions about the cell population.

Massive training scale: The SE model is pretrained on 167 million unperturbed human cells spanning diverse tissues and datasets, while the ST model learns from over 100 million perturbed cells across 70 cellular contexts, providing broad coverage of human biology.

Zero-shot generalization to novel contexts: By leveraging cell embeddings trained on observational data, STATE can predict perturbation effects in completely new cell types where no perturbation experiments were conducted during training.

Cell-Eval benchmarking framework: STATE is released alongside Cell-Eval, a standardized evaluation suite that measures perturbation prediction quality across multiple criteria including discrimination accuracy, differential gene expression recovery, and cell-type-specific response detection.

Standardized gene vocabulary: All datasets are harmonized to the 19,790 human protein-coding Ensembl genes and normalized to a total UMI depth of 10,000, enabling consistent cross-dataset training and evaluation.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap

Youssef Marrakchi, Davide D'Ascenzo, S. Montesano

Jul 2026

Unbalanced Perturbation Dynamics For Cell Fate Design

Qiangwei Peng, Yuchuan Wang, Jianzhen Li, et al.

bioRxiv · Jul 2026

Tabular Foundation Models Are Competitive Cellular Perturbation Predictors Across Biological Scales

G. Palla, Alexander Hillsley, Yang-Joon Kim, et al.

bioRxiv · Jul 2026

STATE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting cellular responses to perturbation across diverse contexts with State

Recent citations

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap

Top citations

What Makes a Representation Good for Single-Cell Perturbation Prediction?

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

STATE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting cellular responses to perturbation across diverse contexts with State

Recent citations

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap

Top citations

What Makes a Representation Good for Single-Cell Perturbation Prediction?

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact