Virtual cell foundation model pairing LLaMA-based cellular encoding with set-aware conditional flow matching to predict single-cell perturbation responses at atlas scale.
SCALE (Scalable Conditional Atlas-Level Endpoint transport) is a virtual cell foundation model for predicting how single cells respond to genetic, chemical, and cytokine perturbations directly from single-cell measurements. Released as a March 2026 bioRxiv preprint by researchers at the Shanghai Artificial Intelligence Laboratory and collaborators, it targets the goal of in silico experimentation—simulating perturbation outcomes that would otherwise require costly wet-lab screens.
The model addresses three persistent obstacles in virtual cell modeling: inefficient training and inference pipelines, unstable behavior when modeling sparse high-dimensional single-cell space, and evaluation protocols that reward reconstruction fidelity over biological accuracy. SCALE reframes perturbation prediction as an endpoint-oriented optimal-transport problem, jointly learning set-level cell-population representations and perturbation-conditioned state transitions rather than modeling individual cells in isolation.
By combining a LLaMA-style set encoder with conditional flow matching and a BioNeMo-based systems backbone, SCALE positions itself among recent large perturbation models such as STATE and Tahoe-trained foundation models, emphasizing both predictive accuracy and the engineering efficiency needed to train on atlas-scale data.
SCALE instantiates an end-to-end formulation that jointly learns set-level representations and perturbation-conditioned state transitions, pairing a LLaMA-style set encoder with a conditional flow-matching architecture for stable transport-based prediction. Training and inference run on a BioNeMo-based framework that improves data throughput, distributed scalability, and deployment efficiency. The model is evaluated on the Tahoe-100M giga-scale single-cell perturbation atlas using a cell-level protocol centered on biologically meaningful metrics, where it improves perturbation-discrimination correlation (PDCorr) by 12.02% and differential-expression overlap by 10.66% over STATE, alongside the reported 12.51x pretraining and 1.29x inference speedups. Parameter count is not disclosed in the preprint.
SCALE is aimed at computational and experimental biologists who use virtual cell models to prioritize perturbations before committing to laboratory screens. By predicting population-level responses to genetic, chemical, or cytokine perturbations, it can support target discovery, drug-response forecasting, and hypothesis generation in single-cell pharmacology, while its efficient training pipeline makes atlas-scale modeling more accessible to groups with constrained compute.
SCALE contributes to the rapidly growing class of perturbation-trained virtual cell models by coupling a transport-based formulation with a production-grade systems backbone, demonstrating measurable gains on the Tahoe-100M benchmark over STATE. Its emphasis on biologically meaningful evaluation and large training-throughput speedups highlights a broader shift toward models that are both accurate and practical at atlas scale. As a recent preprint without released code or weights, its downstream adoption remains to be established.