bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell foundation models
Single-cell

SCALE

Shanghai AI Laboratory

Virtual cell foundation model pairing LLaMA-based cellular encoding with set-aware conditional flow matching to predict single-cell perturbation responses at atlas scale.

Released: March 2026

SCALE (Scalable Conditional Atlas-Level Endpoint transport) is a virtual cell foundation model for predicting how single cells respond to genetic, chemical, and cytokine perturbations directly from single-cell measurements. Released as a March 2026 bioRxiv preprint by researchers at the Shanghai Artificial Intelligence Laboratory and collaborators, it targets the goal of in silico experimentation—simulating perturbation outcomes that would otherwise require costly wet-lab screens.

The model addresses three persistent obstacles in virtual cell modeling: inefficient training and inference pipelines, unstable behavior when modeling sparse high-dimensional single-cell space, and evaluation protocols that reward reconstruction fidelity over biological accuracy. SCALE reframes perturbation prediction as an endpoint-oriented optimal-transport problem, jointly learning set-level cell-population representations and perturbation-conditioned state transitions rather than modeling individual cells in isolation.

By combining a LLaMA-style set encoder with conditional flow matching and a BioNeMo-based systems backbone, SCALE positions itself among recent large perturbation models such as STATE and Tahoe-trained foundation models, emphasizing both predictive accuracy and the engineering efficiency needed to train on atlas-scale data.

#Key Features

  • Set-aware population modeling: Rather than predicting per-cell responses independently, SCALE learns set-level representations of cell populations, capturing perturbation-induced shifts in sparse, high-dimensional single-cell space.
  • Endpoint-oriented flow matching: A conditional flow-matching objective models perturbation as transport between control and perturbed cell-state endpoints, improving stability over reconstruction-centric approaches.
  • LLaMA-based cellular encoding: A LLaMA-style encoder provides the representational backbone for cellular state, adapting a proven language-model architecture to transcriptomic data.
  • Efficient systems backbone: A BioNeMo-based training and inference framework delivers a 12.51x pretraining speedup and 1.29x inference speedup over the prior state-of-the-art pipeline under matched system settings.

#Technical Details

SCALE instantiates an end-to-end formulation that jointly learns set-level representations and perturbation-conditioned state transitions, pairing a LLaMA-style set encoder with a conditional flow-matching architecture for stable transport-based prediction. Training and inference run on a BioNeMo-based framework that improves data throughput, distributed scalability, and deployment efficiency. The model is evaluated on the Tahoe-100M giga-scale single-cell perturbation atlas using a cell-level protocol centered on biologically meaningful metrics, where it improves perturbation-discrimination correlation (PDCorr) by 12.02% and differential-expression overlap by 10.66% over STATE, alongside the reported 12.51x pretraining and 1.29x inference speedups. Parameter count is not disclosed in the preprint.

#Applications

SCALE is aimed at computational and experimental biologists who use virtual cell models to prioritize perturbations before committing to laboratory screens. By predicting population-level responses to genetic, chemical, or cytokine perturbations, it can support target discovery, drug-response forecasting, and hypothesis generation in single-cell pharmacology, while its efficient training pipeline makes atlas-scale modeling more accessible to groups with constrained compute.

#Impact

SCALE contributes to the rapidly growing class of perturbation-trained virtual cell models by coupling a transport-based formulation with a production-grade systems backbone, demonstrating measurable gains on the Tahoe-100M benchmark over STATE. Its emphasis on biologically meaningful evaluation and large training-throughput speedups highlights a broader shift toward models that are both accurate and practical at atlas scale. As a recent preprint without released code or weights, its downstream adoption remains to be established.

Tags

perturbation_response_predictionvirtual_cell_modelingtransformerflow_matchingfoundation_modelgenerativesingle_cell_transcriptomics