Self-supervised 3D masked autoencoder for volumetric fluorescence microscopy, aligned to ESM2 protein embeddings for localization and interaction prediction.
MAE3D-OpenCell is a self-supervised vision model that learns volumetric representations of cells directly from 3D fluorescence microscopy. Most representation learning on cellular imaging has historically collapsed image stacks into 2D — using maximum-intensity projections or single slices — which discards the spatial context of where proteins sit within the volume of a cell. This work, from the Marr Group at the Institute of AI for Health, Helmholtz Munich, asks whether retaining the full 3D structure yields better cellular representations, and demonstrates that it does: a 3D masked autoencoder (MAE-3D) consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks.
The model is pretrained on the public OpenCell dataset, which pairs a nucleus reference channel with an endogenously tagged protein channel across thousands of human proteins imaged by live-cell confocal microscopy. Beyond pure image reconstruction, MAE3D-OpenCell introduces a multimodal twist: image embeddings are aligned to protein language model (ESM2) embeddings of the corresponding protein sequence through an InfoNCE contrastive objective, injecting sequence-level biological priors into the visual representation.
Released as a preprint in June 2026 (accepted to MICCAI 2026), the work sits at the intersection of self-supervised imaging foundation models and protein biology. It is distributed as a family of models — MAE2D, MAE3D, and their protein-aligned final versions (denoted MAE2D* and MAE3D*) — with the 3D, protein-aligned variant serving as the flagship.
The architecture is a transformer-based masked autoencoder operating on 3D inputs of shape (100, 2, 176, 176) — 100 Z-slices, two channels (nucleus and protein), at 176x176 resolution — with 75% patch masking; the 2D variants use the Z-max-projection (176x176, two channels). Cross-modal supervision comes from a frozen pretrained ESM2 protein language model via InfoNCE. On the OpenCell benchmark, the protein-aligned 3D model reaches state-of-the-art results: AUC_micro of 0.952 and F1_micro of 0.742 on protein subcellular localization (gains of +0.003 and +0.010 over baselines), and ROC-AUC of 0.865 on protein-protein interaction prediction (+0.025). The reference implementation pins Python 3.11.9, PyTorch 2.1.2, and CUDA 11.8 for reproducibility.
The learned embeddings support core spatial-proteomics tasks: predicting the subcellular localization of tagged proteins and inferring protein-protein interactions from imaging alone. Because the representations are self-supervised and protein-grounded, they are useful for cell biologists and high-content screening groups who want to mine large fluorescence microscopy collections without exhaustive manual annotation, and as a feature extractor for downstream classification or retrieval over volumetric cell images.
By showing that 3D context plus protein language model alignment improves cellular representations, MAE3D-OpenCell advances the case for treating microscopy as a genuinely volumetric, multimodal modality rather than a stack of 2D pictures. A practical limitation is that the repository does not yet ship pretrained checkpoints — given the very recent preprint, users must currently run pretraining themselves using the provided scripts and OpenCell data. Documentation lives in the repository README and YAML configs rather than a standalone model card or data card, so reproducibility depends on the included dependency pins and configuration files.
Kardoost, A., et al. (2026) 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy. arXiv.
DOI: 10.48550/arXiv.2606.23964Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data