Chan Zuckerberg Initiative
A unified diffusion model enabling bidirectional transformation between protein amino acid sequences and fluorescence microscopy images for subcellular localization prediction.
Fluorescence microscopy is one of the most widely used techniques in cell biology, enabling researchers to visualize where specific proteins localize within living cells. Understanding subcellular localization — whether a protein resides in the nucleus, cytoplasm, mitochondria, endoplasmic reticulum, or other compartments — is fundamental to inferring its function, its role in signaling pathways, and its relevance to disease. Traditionally, determining where a protein localizes requires producing a tagged version of that protein, transfecting it into cells, imaging the result, and repeating this process for each protein of interest. This is expensive, time-consuming, and creates a fundamental limitation: because fluorescence imaging typically images only one protein of interest at a time (with remaining color channels occupied by morphological reference stains), it is challenging to study the spatial co-localization of multiple proteins simultaneously in the same cell.
CELL-Diff, developed by Donghui Zheng and Bo Huang at the University of California San Francisco and presented as a preprint in October 2024, addresses this bottleneck directly. CELL-Diff is a unified diffusion model that learns the relationship between a protein's amino acid sequence and its corresponding fluorescence microscopy image, enabling bidirectional prediction: given a protein's sequence and reference morphology images of a cell, CELL-Diff generates a high-fidelity predicted fluorescence image showing where that protein localizes. Running the model in the reverse direction, given only a fluorescence image, CELL-Diff can generate candidate protein sequences that would produce the observed localization pattern. This bidirectionality makes the model useful both for hypothesis generation (predicting where an uncharacterized protein localizes) and for inverse problems (inferring sequence properties from imaging observations). The model was accepted at ICML 2025 as a poster and is integrated into the CZI Virtual Cell Platform as part of the broader effort to build AI-driven virtual cell models.
CELL-Diff is notable for integrating two modalities — discrete protein sequences and continuous fluorescence microscopy images — within a single generative framework built on diffusion modeling. Earlier approaches to subcellular localization prediction typically used classification models that assigned a protein to one of a fixed set of compartments; CELL-Diff instead generates full images, preserving the continuous spatial distribution of protein localization and capturing complex, multi-compartment or heterogeneous patterns that categorical classifiers cannot represent.
Bidirectional sequence-image transformation: CELL-Diff supports both forward generation (sequence to image) and inverse generation (image to sequence), making it applicable to a range of experimental prediction and design tasks. The forward direction enables virtual imaging of uncharacterized proteins; the reverse enables sequence design for desired localization patterns.
Unified diffusion framework integrating discrete and continuous modalities: Protein sequences are discrete token sequences while fluorescence images are continuous pixel arrays — two modalities that require different mathematical treatments. CELL-Diff integrates a continuous diffusion model for image generation with a discrete model for sequence representation within a single unified architecture, enabling joint modeling without separate modality-specific pipelines.
Transformer-based network backbone: The generative model is built on a transformer network that processes both sequence and image information, allowing the model to attend to relevant sequence features when generating localization patterns and vice versa. This attention-based integration enables the model to capture long-range dependencies between sequence motifs and spatial localization signals.
Morphology-conditioned image generation: Rather than generating localization images from sequence alone, CELL-Diff conditions generation on reference cell morphology images. This allows the model to generate protein localization patterns that are spatially consistent with the specific cell's shape, organelle positions, and structural context, producing predictions that reflect the actual cellular environment rather than a generic idealized cell.
Virtual co-localization of multiple proteins: Because CELL-Diff generates images conditioned on the same morphology reference, users can predict the localization of multiple proteins within the same virtual cell simultaneously. This overcomes the single-protein-per-experiment constraint of real fluorescence microscopy and enables computational studies of protein co-localization and spatial interactions that would require elaborate multi-round imaging protocols in the wet lab.
Training on Human Protein Atlas and OpenCell datasets: CELL-Diff was trained on two large-scale, publicly available fluorescence microscopy datasets with ground-truth sequence-to-image pairings, providing broad coverage of the human proteome and diverse subcellular compartments. Fine-tuning on OpenCell improves performance for proteins captured at native expression levels.
CELL-Diff is implemented as a transformer-based diffusion model that operates on paired inputs: a protein sequence (represented as a token sequence of amino acids) and a reference cell morphology image (multi-channel fluorescence image of cell structure markers such as nucleus and cytoplasm stains). The architecture integrates these two modalities through a shared attention mechanism, where sequence tokens and image patches interact across layers to build a joint representation that informs the generative process.
The diffusion modeling framework operates in the image domain. Given a paired (sequence, morphology image) input, CELL-Diff is trained to denoise a noisy version of the target protein localization image, iteratively refining the prediction toward a clean, high-fidelity output. This process is analogous to standard latent diffusion models used in image generation, with the protein sequence serving as the conditioning signal instead of a text prompt. The model integrates continuous diffusion (for the image) with a discrete representation scheme for the amino acid sequence, unified within the transformer backbone so that cross-modal attention can propagate information in both directions during generation.
CELL-Diff was trained on the Human Protein Atlas (HPA) dataset, which contains fluorescence microscopy images for thousands of human proteins across multiple cell lines, paired with the corresponding protein sequences from UniProt. The model was subsequently fine-tuned on the OpenCell dataset, which provides images of proteins expressed at endogenous levels in human cells, offering a more physiologically accurate reference than overexpression systems. Training on HPA first provides broad proteome-wide coverage, while fine-tuning on OpenCell improves the fidelity of predictions for proteins in their natural abundance regime. Benchmark results demonstrate that CELL-Diff outperforms existing computational methods for generating high-fidelity protein localization images, as measured by image similarity metrics and biological accuracy of predicted localization compartments.
For the inverse direction (image to sequence), the model uses a reverse diffusion process that starts from a target localization image and generates candidate protein sequences conditioned on the observed spatial distribution. This inverse mapping is significantly more challenging than forward generation due to the many-to-one relationship between sequences and localization patterns, but it provides a novel tool for computational protein design with imaging-based fitness functions.
CELL-Diff targets cell biologists, structural biologists, and computational protein designers who want to predict or design subcellular localization properties without running additional wet-lab imaging experiments. In the forward direction, the most immediate application is screening uncharacterized proteins — including predicted open reading frames from genome sequencing, novel synthetic protein designs, or disease-associated variants — for their expected subcellular localization. Rather than imaging each protein individually, researchers can generate virtual localization images from sequence alone and prioritize which proteins warrant experimental characterization. In drug discovery, CELL-Diff can predict whether a therapeutic protein or protein fragment will localize to the intended compartment before committing to cell-based experiments. The virtual co-localization capability is particularly powerful for studying protein-protein interactions through spatial proximity: by generating localization images for two proteins of interest conditioned on the same morphology reference, researchers can assess whether those proteins are likely to occupy the same cellular compartment. In the reverse direction, the image-to-sequence generative mode opens a new class of protein design tasks where the design objective is expressed as a microscopy image, enabling researchers to specify a desired localization pattern and computationally explore protein sequences predicted to achieve it. This connects the fields of protein design and cell imaging in a way that has not been accessible with previous tools.
CELL-Diff establishes a new modality connection in biological foundation modeling by demonstrating that protein sequences and fluorescence microscopy images can be jointly modeled within a single generative diffusion framework. Earlier protein localization predictors produced categorical compartment labels or probability distributions over fixed compartments, discarding the rich spatial information present in actual microscopy data. By generating full images, CELL-Diff preserves spatial heterogeneity, complex multi-compartment distributions, and cell-to-cell variability — features that are biologically meaningful but invisible to classification approaches. The model's acceptance at ICML 2025 reflects its novelty as a methodological contribution to generative modeling of biological imaging data, and its integration into the CZI Virtual Cell Platform positions it alongside a growing ecosystem of tools for AI-driven cell biology. A key limitation is that the model's predictions are conditioned on training data distributions from the HPA and OpenCell datasets, which cover human proteins in specific cell lines; performance may degrade for proteins from other organisms, highly unusual localization patterns, or cell types substantially different from those represented in training. The inverse (image-to-sequence) direction is also still an open scientific challenge, as the mapping from localization pattern to amino acid sequence is highly degenerate — many sequences produce similar localization — meaning the generated sequences require experimental validation. As the scale and diversity of publicly available paired sequence-imaging datasets grow, future versions of CELL-Diff and related models are likely to improve substantially in coverage and accuracy.