MAE3D-OpenCell is a self-supervised vision model that learns volumetric representations of cells directly from 3D fluorescence microscopy. Most representation learning on cellular imaging has historically collapsed image stacks into 2D — using maximum-intensity projections or single slices — which discards the spatial context of where proteins sit within the volume of a cell. This work, from the Marr Group at the Institute of AI for Health, Helmholtz Munich, asks whether retaining the full 3D structure yields better cellular representations, and demonstrates that it does: a 3D masked autoencoder (MAE-3D) consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks.

The model is pretrained on the public OpenCell dataset, which pairs a nucleus reference channel with an endogenously tagged protein channel across thousands of human proteins imaged by live-cell confocal microscopy. Beyond pure image reconstruction, MAE3D-OpenCell introduces a multimodal twist: image embeddings are aligned to protein language model (ESM2) embeddings of the corresponding protein sequence through an InfoNCE contrastive objective, injecting sequence-level biological priors into the visual representation.

Released as a preprint in June 2026 (accepted to MICCAI 2026), the work sits at the intersection of self-supervised imaging foundation models and protein biology. It is distributed as a family of models — MAE2D, MAE3D, and their protein-aligned final versions (denoted MAE2D* and MAE3D*) — with the 3D, protein-aligned variant serving as the flagship.

Key Features

Volumetric pretraining: A 3D masked autoencoder reconstructs masked patches across full image stacks (Z, C, H, W), preserving depth information that 2D projection-based methods discard.
Cross-modal protein alignment: An InfoNCE contrastive loss aligns image embeddings with ESM2 protein language model embeddings, grounding visual features in protein sequence identity.
Channel cross-attention: A dual-stream encoder/decoder applies position-wise attention between the nucleus reference channel and the protein channel, letting the model relate protein signal to subcellular context.
Frequency-domain regularization: An FFT-based reconstruction loss in the frequency domain sharpens fine spatial detail beyond pixel-space reconstruction alone.
Staged training recipe: Models progress from a baseline MAE, to cross-attention and FFT-augmented variants, to a final protein-aligned model that resumes from the FFT checkpoint.

Technical Details

The architecture is a transformer-based masked autoencoder operating on 3D inputs of shape (100, 2, 176, 176) — 100 Z-slices, two channels (nucleus and protein), at 176x176 resolution — with 75% patch masking; the 2D variants use the Z-max-projection (176x176, two channels). Cross-modal supervision comes from a frozen pretrained ESM2 protein language model via InfoNCE. On the OpenCell benchmark, the protein-aligned 3D model reaches state-of-the-art results: AUC_micro of 0.952 and F1_micro of 0.742 on protein subcellular localization (gains of +0.003 and +0.010 over baselines), and ROC-AUC of 0.865 on protein-protein interaction prediction (+0.025). The reference implementation pins Python 3.11.9, PyTorch 2.1.2, and CUDA 11.8 for reproducibility.

Applications

The learned embeddings support core spatial-proteomics tasks: predicting the subcellular localization of tagged proteins and inferring protein-protein interactions from imaging alone. Because the representations are self-supervised and protein-grounded, they are useful for cell biologists and high-content screening groups who want to mine large fluorescence microscopy collections without exhaustive manual annotation, and as a feature extractor for downstream classification or retrieval over volumetric cell images.

Impact

By showing that 3D context plus protein language model alignment improves cellular representations, MAE3D-OpenCell advances the case for treating microscopy as a genuinely volumetric, multimodal modality rather than a stack of 2D pictures. A practical limitation is that the repository does not yet ship pretrained checkpoints — given the very recent preprint, users must currently run pretraining themselves using the provided scripts and OpenCell data. Documentation lives in the repository README and YAML configs rather than a standalone model card or data card, so reproducibility depends on the included dependency pins and configuration files.

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Preprint

Kardoost, A., et al. (2026) 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy. arXiv.

DOI: 10.48550/arXiv.2606.23964

Key Features

Volumetric pretraining: A 3D masked autoencoder reconstructs masked patches across full image stacks (Z, C, H, W), preserving depth information that 2D projection-based methods discard.

Cross-modal protein alignment: An InfoNCE contrastive loss aligns image embeddings with ESM2 protein language model embeddings, grounding visual features in protein sequence identity.

Channel cross-attention: A dual-stream encoder/decoder applies position-wise attention between the nucleus reference channel and the protein channel, letting the model relate protein signal to subcellular context.

Frequency-domain regularization: An FFT-based reconstruction loss in the frequency domain sharpens fine spatial detail beyond pixel-space reconstruction alone.

Staged training recipe: Models progress from a baseline MAE, to cross-attention and FFT-augmented variants, to a final protein-aligned model that resumes from the FFT checkpoint.

Technical Details

Applications

Impact

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Preprint

Kardoost, A., et al. (2026) 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy. arXiv.

DOI: 10.48550/arXiv.2606.23964

MAE3D-OpenCell

Key Features

Technical Details

Applications

Impact

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Recent citations

Top citations

GitHub

Fields of citing research

Openness

Tags

Resources

MAE3D-OpenCell

Key Features

Technical Details

Applications

Impact

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Recent citations

Top citations

GitHub

Fields of citing research

Openness

Tags

Resources

MAE3D-OpenCell

#Key Features

#Technical Details

#Applications

#Impact

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Recent citations

Top citations

GitHub

Fields of citing research

Openness

Tags

Resources

MAE3D-OpenCell

#Key Features

#Technical Details

#Applications

#Impact

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Recent citations

Top citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact