Theis Lab
Variational autoencoder that predicts single-cell perturbation responses across cell types and species using latent space vector arithmetic.
Predicting how cells will respond to an experimental perturbation — a drug, an infection, a genetic manipulation — requires understanding what makes the responding cell type different from the reference and how that perturbation shifts gene expression in contexts that have already been measured. scGen (single-cell generative model) provides an elegant solution to this problem by leveraging a fundamental property of well-trained variational autoencoders: the arithmetic structure of their latent spaces. Rather than modeling the perturbation explicitly, scGen encodes unperturbed and perturbed cells into a shared latent space, computes a delta vector representing the perturbation effect, and applies that delta to encode an unperturbed query cell, then decodes the result to obtain the predicted perturbed expression.
scGen was developed by Mohammad Lotfollahi, F. Alexander Wolf, and Fabian J. Theis at Helmholtz Center Munich and published in Nature Methods in July 2019. It was among the first demonstrations that generative modeling with latent space arithmetic could transfer biological perturbation effects across cell types and even species, enabling computational prediction of experiments that are expensive or impossible to perform directly. The approach drew explicit inspiration from the vector arithmetic used in word embedding models like Word2Vec, where semantic relationships can be captured as vector offsets — here applied to the biological problem of predicting cellular state transitions.
The simplicity of the conceptual framework belies genuine power: scGen accurately predicted the transcriptional response of CD4+ T cells to interferon-beta stimulation from training data that included interferon-stimulated CD14+ monocytes but not interferon-stimulated T cells, and it correctly predicted human cell responses to LPS stimulation after training only on mouse LPS response data. These cross-cell-type and cross-species predictions demonstrated that scGen was learning something generalizable about perturbation biology rather than memorizing cell-type-specific patterns.
scGen uses a standard variational autoencoder architecture with a multi-layer perceptron encoder and decoder. The encoder maps raw gene expression counts (after log-normalization) to a Gaussian distribution over a low-dimensional latent space, typically 100 dimensions. The decoder maps sampled latent codes back to gene expression values. Perturbation prediction proceeds in three steps: first, a population of control cells (the reference condition) and perturbed cells of the same type are encoded to obtain their latent representations; second, a delta vector is computed as the mean latent representation of the perturbed cells minus the mean of the control cells, capturing the direction and magnitude of the perturbation effect in latent space; third, to predict the response of a target cell type, the encoded latent coordinates of that cell type's control cells are shifted by the delta vector, and the result is decoded to yield predicted expression values.
The method was validated on three biological datasets. The Kang et al. 2018 dataset of peripheral blood mononuclear cells (PBMCs) stimulated with interferon-beta provided a cross-cell-type generalization benchmark: scGen was trained on all perturbed cell types except CD4+ T cells and predicted the T cell response with high accuracy, achieving a mean squared error substantially lower than baselines including raw expression averaging and PCA-based methods. The Haber et al. 2017 intestinal organoid dataset was used for dose-response analysis. For cross-species transfer, the Zheng et al. 2016 mouse LPS dataset was used for training and the human LPS response for evaluation, with predictions compared against measured DEGs in the Saliba et al. 2014 dataset. The model's batch correction performance was validated on an unpaired human and mouse pancreatic islet integration task.
scGen's primary application is the in silico prediction of perturbation experiments that are expensive or ethically constrained. In drug discovery, the model enables researchers to predict how a new cell type or patient-derived cell population will respond to a compound based on existing perturbation data from related cell types. In infectious disease research, where patient-specific single-cell data during active infection may be scarce, scGen can extrapolate infection responses from in vitro or murine data to predict human in vivo cell behavior. The batch correction functionality makes scGen useful for integrating datasets generated across different laboratories or sequencing platforms. The model has also been applied in developmental biology to predict how progenitor cell types will respond to differentiation signals based on measurements in related cell states.
scGen was one of the earliest demonstrations that the pretraining-then-prediction paradigm from natural language processing could be adapted to single-cell biology with quantitative accuracy and genuine cross-context generalization. Its publication in Nature Methods in 2019 preceded the current wave of large single-cell foundation models and established the conceptual template — encode cells into a shared latent space, perform arithmetic, decode — that many subsequent methods have built upon. The approach directly inspired later perturbation models including CPA, which extends scGen's linear latent arithmetic to handle drug combinations, doses, and covariates simultaneously in a more structured framework. The model's demonstration of cross-species transfer, in particular, opened research directions in leveraging murine experimental systems to inform predictions about human disease states. The codebase has been maintained and modernized within the scvi-tools ecosystem, ensuring that scGen remains accessible to researchers years after its initial publication.
Sources: