bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingSingle-cell

MAE3D-OpenCell

Helmholtz Munich

Self-supervised 3D masked autoencoder for volumetric fluorescence microscopy, aligned to ESM2 protein embeddings for localization and interaction prediction.

Released: June 2026

MAE3D-OpenCell is a self-supervised vision model that learns volumetric representations of cells directly from 3D fluorescence microscopy. Most representation learning on cellular imaging has historically collapsed image stacks into 2D — using maximum-intensity projections or single slices — which discards the spatial context of where proteins sit within the volume of a cell. This work, from the Marr Group at the Institute of AI for Health, Helmholtz Munich, asks whether retaining the full 3D structure yields better cellular representations, and demonstrates that it does: a 3D masked autoencoder (MAE-3D) consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks.

The model is pretrained on the public OpenCell dataset, which pairs a nucleus reference channel with an endogenously tagged protein channel across thousands of human proteins imaged by live-cell confocal microscopy. Beyond pure image reconstruction, MAE3D-OpenCell introduces a multimodal twist: image embeddings are aligned to protein language model (ESM2) embeddings of the corresponding protein sequence through an InfoNCE contrastive objective, injecting sequence-level biological priors into the visual representation.

Released as a preprint in June 2026 (accepted to MICCAI 2026), the work sits at the intersection of self-supervised imaging foundation models and protein biology. It is distributed as a family of models — MAE2D, MAE3D, and their protein-aligned final versions (denoted MAE2D* and MAE3D*) — with the 3D, protein-aligned variant serving as the flagship.

#Key Features

  • Volumetric pretraining: A 3D masked autoencoder reconstructs masked patches across full image stacks (Z, C, H, W), preserving depth information that 2D projection-based methods discard.
  • Cross-modal protein alignment: An InfoNCE contrastive loss aligns image embeddings with ESM2 protein language model embeddings, grounding visual features in protein sequence identity.
  • Channel cross-attention: A dual-stream encoder/decoder applies position-wise attention between the nucleus reference channel and the protein channel, letting the model relate protein signal to subcellular context.
  • Frequency-domain regularization: An FFT-based reconstruction loss in the frequency domain sharpens fine spatial detail beyond pixel-space reconstruction alone.
  • Staged training recipe: Models progress from a baseline MAE, to cross-attention and FFT-augmented variants, to a final protein-aligned model that resumes from the FFT checkpoint.

#Technical Details

The architecture is a transformer-based masked autoencoder operating on 3D inputs of shape (100, 2, 176, 176) — 100 Z-slices, two channels (nucleus and protein), at 176x176 resolution — with 75% patch masking; the 2D variants use the Z-max-projection (176x176, two channels). Cross-modal supervision comes from a frozen pretrained ESM2 protein language model via InfoNCE. On the OpenCell benchmark, the protein-aligned 3D model reaches state-of-the-art results: AUC_micro of 0.952 and F1_micro of 0.742 on protein subcellular localization (gains of +0.003 and +0.010 over baselines), and ROC-AUC of 0.865 on protein-protein interaction prediction (+0.025). The reference implementation pins Python 3.11.9, PyTorch 2.1.2, and CUDA 11.8 for reproducibility.

#Applications

The learned embeddings support core spatial-proteomics tasks: predicting the subcellular localization of tagged proteins and inferring protein-protein interactions from imaging alone. Because the representations are self-supervised and protein-grounded, they are useful for cell biologists and high-content screening groups who want to mine large fluorescence microscopy collections without exhaustive manual annotation, and as a feature extractor for downstream classification or retrieval over volumetric cell images.

#Impact

By showing that 3D context plus protein language model alignment improves cellular representations, MAE3D-OpenCell advances the case for treating microscopy as a genuinely volumetric, multimodal modality rather than a stack of 2D pictures. A practical limitation is that the repository does not yet ship pretrained checkpoints — given the very recent preprint, users must currently run pretraining themselves using the provided scripts and OpenCell data. Documentation lives in the repository README and YAML configs rather than a standalone model card or data card, so reproducibility depends on the included dependency pins and configuration files.

Citation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Preprint

Kardoost, A., et al. (2026) 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy. arXiv.

DOI: 10.48550/arXiv.2606.23964

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

GitHub

Stars1
Forks0
Open Issues0
Contributors2
Last Push5d ago
LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
71Open
Usability — can I run it?67
Reproducibility — can I retrain it?78
Model Openness Framework
Unclassified
Missing required components

Tags

cell_biologycontrastive_learningfluorescence_microscopymasked_autoencodermultimodalprotein_protein_interactionrepresentation_learningself_supervisedvision_transformer

Resources

GitHub RepositoryResearch Paper