Overview

High-content imaging is among the richest phenotypic readouts available in drug discovery: a single high-resolution microscopy image of a cell captures dozens of morphological features simultaneously, encoding information about nuclear shape, cytoskeletal organization, organelle distribution, and cell size that complements what transcriptomics can measure. High-throughput imaging screens, however, produce datasets that are far too large to exhaustively cover all compounds of interest, and many potential drugs show no activity in initial screens, making it difficult to identify promising directions computationally. IMPA (IMage Perturbation Autoencoder) addresses this by learning to predict what a cell would look like if treated with a given perturbation, using only an image of the untreated control cell as input.

IMPA was developed by Anna Palma, Fabian J. Theis, and Mohammad Lotfollahi at Helmholtz Center Munich. The preprint appeared on bioRxiv in July 2023 and the work was published in Nature Communications in January 2024. The model extends the perturbation prediction paradigm — previously applied to transcriptomics by scGen and CPA from the same lab — to the image domain, enabling in silico morphological profiling of compounds before experimental measurement.

IMPA operates as a generative style-transfer model: it separates the content of a cell image (cell-intrinsic structure, independent of treatment) from the style (perturbation-specific and batch-specific appearance), then recombines the content of an unseen control cell with the style of a target perturbation to synthesize the predicted treated cell image. This content-style disentanglement is learned adversarially, producing a model that can transfer perturbation morphology across cells and predict what untested perturbations would look like by learning their style embeddings. Crucially, the style and content components are learned in a way that also accounts for batch effects — a pervasive challenge in high-content imaging — allowing IMPA to model perturbation effects independently of technical variation across imaging runs.

Key Features

Counterfactual image generation: IMPA generates predicted images of cells under perturbation conditions they were never experimentally exposed to, using the untreated image as a content scaffold and the learned perturbation style as the modification.
Content-style disentanglement: The model explicitly separates perturbation-independent cellular structure (content) from perturbation-specific appearance (style) through adversarial training, enabling clean transfer of perturbation effects across individual cells.
Batch effect modeling: Style representations encode both perturbation identity and batch identity as separable components, allowing IMPA to remove technical batch-to-batch variation from generated images while preserving biologically meaningful perturbation effects.
Unseen perturbation prediction: For compounds structurally related to training set members, IMPA demonstrated improved performance over state-of-the-art generative baselines in predicting morphological responses for held-out perturbations.
Dual modality validation: IMPA was validated on both chemical perturbation datasets (breast cancer cells treated with small molecules) and genetic perturbation datasets (U2OS osteosarcoma cells from the RxRx1 dataset with hundreds of genetic conditions), demonstrating generality across perturbation types.
Population-level summary statistics: Beyond individual cell image generation, IMPA accurately predicts population-level morphological statistics — capturing how the distribution of cell morphologies shifts under treatment — which are the quantities most relevant for drug discovery applications.

Technical Details

IMPA is a generative adversarial network (GAN) based on an autoencoder architecture. The encoder maps an input cell image to two latent components: a content code representing perturbation-independent cellular structure and a style code representing the appearance associated with a specific perturbation and batch condition. A decoder reconstructs images from any combination of content and style codes. The adversarial training procedure uses discriminators to enforce that the content code is invariant to perturbation and batch identity, while the style code captures these sources of variation. Perturbation style embeddings are learned as look-up table vectors associated with each experimental condition.

At inference time, the content code of an untreated control cell is extracted by the encoder, and the style code of the target perturbation is retrieved from the learned embedding table. The decoder then synthesizes a predicted image from this content-style combination, representing the counterfactual treated cell. For unseen perturbations, style embeddings can be estimated from related compounds by averaging or interpolation in embedding space, enabling predictions even for compounds not present in training.

IMPA was trained and evaluated on two primary datasets. For chemical perturbations, the RxRx1 dataset (Recursion Pharmaceuticals) provided thousands of breast cancer cells treated with a library of small molecules imaged at standardized conditions. For genetic perturbations, U2OS osteosarcoma cells from the RxRx19a dataset with hundreds of CRISPR perturbation conditions were used. Evaluation used standard image quality metrics alongside biologically meaningful morphological profile distance metrics. IMPA outperformed existing generative baselines (including CycleGAN and DISCERN) on the unseen drug prediction task when compounds were structurally similar to training set members, with the expected caveat that prediction accuracy decreases for chemically distant compounds.

Applications

IMPA is primarily targeted at pharmaceutical researchers working in phenotypic drug discovery, where the goal is to identify compounds that produce desired morphological changes in cells without prior knowledge of the molecular target. By predicting morphological responses for a library of compounds before committing to experimental screening, IMPA can triage large compound libraries and prioritize candidates for wet-lab validation, reducing the number of compounds that need to be physically tested. The batch effect correction capability also makes IMPA useful for harmonizing imaging datasets collected across different experimental campaigns or imaging systems, enabling large-scale meta-analyses of morphological phenotypes. In CRISPR functional genomics, IMPA can be used to predict the morphological consequences of genetic perturbations — complementing transcriptomics-based perturbation predictions to provide a more complete phenotypic picture of gene function.

Impact

IMPA extends the single-cell perturbation prediction paradigm, pioneered in transcriptomics by scGen and CPA, to the image domain — demonstrating that the same conceptual approach of learning disentangled perturbation representations can transfer across biological measurement modalities. Published in Nature Communications, the work contributes to a growing body of evidence that generative modeling can produce useful predictions of cellular phenotype, not just gene expression. A key practical limitation acknowledged by the authors is that prediction quality degrades for compounds that are structurally dissimilar to training set members, which is expected for a style-transfer approach and motivates future work on more biologically grounded perturbation representations. IMPA is part of a broader research agenda at the Theis Lab exploring multimodal perturbation modeling, and its integration with the MultiCPA framework is a natural direction for bridging image and transcriptome perturbation predictions within a unified model.

Sources:

Overview

Key Features

Counterfactual image generation: IMPA generates predicted images of cells under perturbation conditions they were never experimentally exposed to, using the untreated image as a content scaffold and the learned perturbation style as the modification.

Content-style disentanglement: The model explicitly separates perturbation-independent cellular structure (content) from perturbation-specific appearance (style) through adversarial training, enabling clean transfer of perturbation effects across individual cells.

Batch effect modeling: Style representations encode both perturbation identity and batch identity as separable components, allowing IMPA to remove technical batch-to-batch variation from generated images while preserving biologically meaningful perturbation effects.

Unseen perturbation prediction: For compounds structurally related to training set members, IMPA demonstrated improved performance over state-of-the-art generative baselines in predicting morphological responses for held-out perturbations.

Dual modality validation: IMPA was validated on both chemical perturbation datasets (breast cancer cells treated with small molecules) and genetic perturbation datasets (U2OS osteosarcoma cells from the RxRx1 dataset with hundreds of genetic conditions), demonstrating generality across perturbation types.

Population-level summary statistics: Beyond individual cell image generation, IMPA accurately predicts population-level morphological statistics — capturing how the distribution of cell morphologies shifts under treatment — which are the quantities most relevant for drug discovery applications.

Technical Details

Applications

Impact

Sources:

IMPA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

IMPA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources