Theis Lab
Compositional Perturbation Autoencoder that predicts single-cell transcriptional responses to unseen drug combinations and doses using disentangled latent representations.
Understanding how individual cells respond to drug treatments is one of the most pressing challenges in precision medicine. Experimental measurement of every possible drug combination and dose across diverse cell types is combinatorially intractable — even a screen of just a few hundred compounds at multiple doses generates an astronomical number of conditions that no laboratory can exhaustively test. CPA (Compositional Perturbation Autoencoder) addresses this problem directly by learning to predict transcriptional responses at the single-cell level for drug combinations and dosages that were never experimentally observed.
CPA was developed by Mohammad Lotfollahi, Anna Klimovskaia, Carlo De Donno, and colleagues at Helmholtz Center Munich (Theis Lab) in collaboration with Facebook AI Research. The preprint appeared on bioRxiv in April 2021, and the substantially expanded work was published in Molecular Systems Biology in 2023. The key insight behind CPA is architectural: by separating a cell's gene expression state into compositional components — a basal state representing the unperturbed cell, plus additive perturbation embeddings and covariate embeddings — the model can synthesize predictions for new combinations simply by composing the relevant embeddings together at inference time.
This compositional design gives CPA an important advantage over black-box deep learning models. Because each perturbation occupies a learnable vector in a shared embedding space, the relationship between drugs can be directly interrogated by examining the geometry of that space. Drug similarity analysis and dose-response curve extrapolation both emerge naturally from the trained model without requiring any special-purpose machinery. CPA is implemented within the scvi-tools ecosystem and integrates cleanly with standard AnnData-based single-cell analysis workflows, making it accessible to researchers working in the Scanpy/AnnData ecosystem.
CPA is a variational autoencoder with two key modifications. First, an adversarial training procedure is applied during encoding: after the encoder maps gene expression to a latent vector, discriminator networks receive that vector and attempt to predict the perturbation identity and covariate labels. The encoder is trained to simultaneously minimize the reconstruction loss and fool the discriminators, driving perturbation and covariate signal out of the basal latent representation. Second, perturbation and covariate embeddings are learned as separate parameter matrices; at inference time, the predicted expression for a given condition is obtained by decoding the sum of the basal latent vector, the perturbation embedding(s), and the covariate embedding.
For dose modeling, each perturbation embedding is scaled by a small neural network that takes the scalar dose value as input, producing a dose-dependent perturbation vector. This allows continuous interpolation and extrapolation across dose values rather than requiring discretized dose bins. The model was validated extensively on the sci-Plex dataset (gene expression from ~50,000 cells treated with 188 compounds at 4 doses) and the Norman et al. 2019 combinatorial genetic perturbation dataset (containing 105,462 cells across 284 genetic conditions). On the genetic perturbation task, CPA was able to generate predictions for 97.6% of all possible pairwise genetic combinations (5,329 conditions) from training on a subset of single perturbations. The published MSB version introduced additional benchmarks demonstrating accurate cross-species perturbation transfer and integration with CRISPR-based perturbation atlases.
CPA is primarily used for two broad purposes: computational drug screening and experimental design optimization. In drug screening contexts, CPA can pre-screen predicted transcriptional responses for thousands of drug combinations before committing to laboratory validation, dramatically narrowing the experimental space to the most promising candidates. The dose-response modeling capability is particularly valuable for identifying efficacious dose ranges and predicting synergistic or antagonistic interactions between drug pairs. In CRISPR functional genomics studies, CPA can impute the expected transcriptional output of genetic combinations not covered in a perturbation screen, enabling researchers to identify genetic interactions of interest without exhaustive combinatorial experimentation. The model is also used for data augmentation — generating synthetic single-cell perturbation data to train downstream classifiers and causal models where experimental data is sparse.
CPA established a new paradigm for perturbation modeling in single-cell genomics by demonstrating that compositional latent space arithmetic can generalize to genuinely novel combinations rather than merely interpolating within a training distribution. The paper has attracted substantial attention in the computational biology community and has been followed by a family of related models, including MultiCPA, which extends the framework to multimodal readouts such as imaging and proteomics alongside transcriptomics. The open-source scvi-tools integration has made CPA broadly accessible, and the model has been adopted in pharmaceutical industry workflows for computational drug combination screening. CPA also directly influenced the design of subsequent perturbation models including scGEN extensions and the GEARS graph-based approach, and it remains one of the central benchmarks against which new perturbation prediction methods are evaluated.
Sources: