CPA

Single-cell perturbation prediction model that forecasts transcriptional responses to drug combinations and doses never experimentally measured.

Released: April 2021

Understanding how individual cells respond to drug treatments is one of the most pressing challenges in precision medicine. Experimental measurement of every possible drug combination and dose across diverse cell types is combinatorially intractable — even a screen of just a few hundred compounds at multiple doses generates an astronomical number of conditions that no laboratory can exhaustively test. CPA (Compositional Perturbation Autoencoder) addresses this problem directly by learning to predict transcriptional responses at the single-cell level for drug combinations and dosages that were never experimentally observed.

CPA was developed by Mohammad Lotfollahi, Anna Klimovskaia, Carlo De Donno, and colleagues at Helmholtz Center Munich (Theis Lab) in collaboration with Facebook AI Research. The preprint appeared on bioRxiv in April 2021, and the substantially expanded work was published in Molecular Systems Biology in 2023. The key insight behind CPA is architectural: by separating a cell's gene expression state into compositional components — a basal state representing the unperturbed cell, plus additive perturbation embeddings and covariate embeddings — the model can synthesize predictions for new combinations simply by composing the relevant embeddings together at inference time.

This compositional design gives CPA an important advantage over black-box deep learning models. Because each perturbation occupies a learnable vector in a shared embedding space, the relationship between drugs can be directly interrogated by examining the geometry of that space. Drug similarity analysis and dose-response curve extrapolation both emerge naturally from the trained model without requiring any special-purpose machinery. CPA is implemented within the scvi-tools ecosystem and integrates cleanly with standard AnnData-based single-cell analysis workflows, making it accessible to researchers working in the Scanpy/AnnData ecosystem.

Key Features

Compositional latent arithmetic: The model decomposes each cell's expression into a basal state plus additive perturbation and covariate embeddings, allowing prediction of unseen drug combinations by summing the corresponding learned vectors.
Adversarial disentanglement: Discriminator networks are trained simultaneously with the autoencoder to remove perturbation and covariate signal from the basal latent state, ensuring that the basal representation captures only cell-intrinsic variation.
Continuous dose-response modeling: Learnable dose-scaling networks map scalar dose values to perturbation embedding magnitudes, enabling the model to generate predicted expression profiles at arbitrary unobserved doses without discretization.
Out-of-distribution generalization: CPA can predict cellular responses to drug combinations that were entirely absent from training data, including cross-species and cross-cell-type extrapolation as validated in the original publication.
Interpretable drug embeddings: Learned drug representation vectors enable drug similarity analysis through clustering and dimensionality reduction, revealing biologically meaningful groupings of compounds with related mechanisms of action.
Uncertainty quantification: The model provides estimates of prediction confidence alongside point predictions, helping researchers identify conditions where experimental validation is most needed.

Technical Details

CPA is a variational autoencoder with two key modifications. First, an adversarial training procedure is applied during encoding: after the encoder maps gene expression to a latent vector, discriminator networks receive that vector and attempt to predict the perturbation identity and covariate labels. The encoder is trained to simultaneously minimize the reconstruction loss and fool the discriminators, driving perturbation and covariate signal out of the basal latent representation. Second, perturbation and covariate embeddings are learned as separate parameter matrices; at inference time, the predicted expression for a given condition is obtained by decoding the sum of the basal latent vector, the perturbation embedding(s), and the covariate embedding.

For dose modeling, each perturbation embedding is scaled by a small neural network that takes the scalar dose value as input, producing a dose-dependent perturbation vector. This allows continuous interpolation and extrapolation across dose values rather than requiring discretized dose bins. The model was validated extensively on the sci-Plex dataset (gene expression from ~50,000 cells treated with 188 compounds at 4 doses) and the Norman et al. 2019 combinatorial genetic perturbation dataset (containing 105,462 cells across 284 genetic conditions). On the genetic perturbation task, CPA was able to generate predictions for 97.6% of all possible pairwise genetic combinations (5,329 conditions) from training on a subset of single perturbations. The published MSB version introduced additional benchmarks demonstrating accurate cross-species perturbation transfer and integration with CRISPR-based perturbation atlases.

Applications

CPA is primarily used for two broad purposes: computational drug screening and experimental design optimization. In drug screening contexts, CPA can pre-screen predicted transcriptional responses for thousands of drug combinations before committing to laboratory validation, dramatically narrowing the experimental space to the most promising candidates. The dose-response modeling capability is particularly valuable for identifying efficacious dose ranges and predicting synergistic or antagonistic interactions between drug pairs. In CRISPR functional genomics studies, CPA can impute the expected transcriptional output of genetic combinations not covered in a perturbation screen, enabling researchers to identify genetic interactions of interest without exhaustive combinatorial experimentation. The model is also used for data augmentation — generating synthetic single-cell perturbation data to train downstream classifiers and causal models where experimental data is sparse.

Impact

CPA established a new paradigm for perturbation modeling in single-cell genomics by demonstrating that compositional latent space arithmetic can generalize to genuinely novel combinations rather than merely interpolating within a training distribution. The paper has attracted substantial attention in the computational biology community and has been followed by a family of related models, including MultiCPA, which extends the framework to multimodal readouts such as imaging and proteomics alongside transcriptomics. The open-source scvi-tools integration has made CPA broadly accessible, and the model has been adopted in pharmaceutical industry workflows for computational drug combination screening. CPA also directly influenced the design of subsequent perturbation models including scGEN extensions and the GEARS graph-based approach, and it remains one of the central benchmarks against which new perturbation prediction methods are evaluated.

Sources:

Citation

Predicting cellular responses to complex perturbations in high‐throughput screens

Lotfollahi, M., et al. (2023) Predicting cellular responses to complex perturbations in high‐throughput screens. Molecular Systems Biology.

DOI: 10.15252/msb.202211517

Recent citations

Papers that recently cited this model.

Synthesizing Mechanistic Hypotheses from Single-Cell Omics via Discretized Feature Attribution and Empirical Language Model Grounding
J. Chen, Yunqi Hong, Alexandra Bermudez, et al.
bioRxiv · Jul 2026
0
The projection basis determines the information ceiling for perturbation prediction
Simone Bianco
bioRxiv · Jul 2026
0
Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap
Youssef Marrakchi, Davide D'Ascenzo, S. Montesano
Jul 2026
0

Top citations

The most-cited papers that cite this model.

scGPT: toward building a foundation model for single-cell multi-omics using generative AI
Haotian Cui, Chloe Wang, Hassaan Maan, et al.
Nature Methods · Feb 2024
1.1K
Large Scale Foundation Model on Single-cell Transcriptomics
Minsheng Hao, Jing Gong, Xin Zeng, et al.
bioRxiv · Jun 2023
526
Predicting transcriptional outcomes of novel multigene perturbations with GEARS
Yusuf H. Roohani, Kexin Huang, J. Leskovec
Nature Biotechnology · Aug 2023
335
Generative artificial intelligence empowers digital twins in drug discovery and clinical trials
Maria Bordukova, Nikita Makarov, Raul Rodriguez-Esteban, et al.
Expert Opinion on Drug Discovery · Oct 2023
155
How to build the virtual cell with artificial intelligence: Priorities and opportunities
Charlotte Bunne, Yusuf H. Roohani, Yanay Rosen, et al.
Cell · Dec 2024
150

Citations

Total Citations321

Influential30

References64

GitHub

Stars149

Forks30

Open Issues23

Contributors5

Last Push1y ago

LanguagePython

LicenseBSD-3-Clause

Fields of citing research

Computer Science90%
Biology86%
Medicine59%
Mathematics5%
Chemistry5%
Engineering4%
Environmental Science2%
Physics1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

78Open

Usability — can I run it?95

Reproducibility — can I retrain it?58

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Documentation

Key Features

Compositional latent arithmetic: The model decomposes each cell's expression into a basal state plus additive perturbation and covariate embeddings, allowing prediction of unseen drug combinations by summing the corresponding learned vectors.

Adversarial disentanglement: Discriminator networks are trained simultaneously with the autoencoder to remove perturbation and covariate signal from the basal latent state, ensuring that the basal representation captures only cell-intrinsic variation.

Continuous dose-response modeling: Learnable dose-scaling networks map scalar dose values to perturbation embedding magnitudes, enabling the model to generate predicted expression profiles at arbitrary unobserved doses without discretization.

Out-of-distribution generalization: CPA can predict cellular responses to drug combinations that were entirely absent from training data, including cross-species and cross-cell-type extrapolation as validated in the original publication.

Interpretable drug embeddings: Learned drug representation vectors enable drug similarity analysis through clustering and dimensionality reduction, revealing biologically meaningful groupings of compounds with related mechanisms of action.

Uncertainty quantification: The model provides estimates of prediction confidence alongside point predictions, helping researchers identify conditions where experimental validation is most needed.

Technical Details

Applications

Impact

Sources:

CPA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting cellular responses to complex perturbations in high‐throughput screens

Recent citations

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CPA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting cellular responses to complex perturbations in high‐throughput screens

Recent citations

Score Distributions, Not Cells: Evaluating Single-Cell Perturbations Under Class Overlap

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact