bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

M3AE

Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University

Self-supervised medical vision-and-language pretraining via multi-modal masked autoencoders that reconstruct masked image patches and text tokens.

Released: September 2022

M3AE (Multi-Modal Masked Autoencoders) is a self-supervised framework for medical vision-and-language pre-training that learns transferable cross-modal representations by reconstructing missing content from jointly masked medical images and their accompanying text. Introduced by Zhihong Chen and colleagues at the Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong, Shenzhen, and Sun Yat-sen University, the work was published at MICCAI 2022. It extends the masked-autoencoder paradigm—popularized for images by MAE and for language by BERT—into a unified multi-modal objective tailored to the medical domain, where paired radiology images and reports are abundant but expert labels are scarce.

The central insight is that images and text carry very different information densities, so a single shared masking strategy is suboptimal. M3AE applies an asymmetric masking scheme: a high masking ratio for image patches (which are spatially redundant) and a lower ratio for text tokens (which are information-dense). By forcing the model to reconstruct masked pixels and masked words from the surviving cross-modal context, M3AE learns to align visual and textual concepts without requiring any manual annotations during pre-training.

Positioned among early medical vision-language foundation models, M3AE demonstrated that generative masked reconstruction—rather than purely contrastive alignment—can serve as an effective pre-training signal for clinical multimodal data, providing strong initialization for a range of downstream radiology tasks.

#Key Features

  • Asymmetric multi-modal masking: Uses a considerably larger masking ratio for images than for text, reflecting the lower information density of visual patches versus language tokens.
  • Dual reconstruction objective: Jointly recovers masked image pixels and masked text tokens, learning fine-grained cross-modal correspondence as a byproduct of the reconstruction task.
  • Modality-specific decoders: Pairs a Transformer-based decoder for vision with a lightweight MLP decoder for language, matching each decoder to the abstraction level of its modality.
  • Multi-layer feature fusion: Draws visual and textual features from multiple encoder layers to handle the differing levels of abstraction across vision and language.
  • Self-supervised, label-free pre-training: Requires only paired medical image-caption data, avoiding the cost of expert annotation while producing reusable representations.

#Technical Details

M3AE pre-trains on paired medical image-text data drawn from ROCO (Radiology Objects in Context) and MedICaT, two large collections of radiology figures with captions. The architecture couples a vision-transformer image encoder and a transformer text encoder whose multi-layer features feed a Transformer visual decoder and an MLP textual decoder; training minimizes a combined pixel-reconstruction and masked-language-modeling loss. The authors also constructed a medical vision-and-language benchmark spanning three task families to evaluate transfer. On downstream evaluation, M3AE achieves state-of-the-art results across medical visual question answering (VQA-RAD, SLAKE, and the VQA-Med 2019 set), medical image-text classification (MELINDA), and medical image-caption retrieval (ROCO), outperforming prior contrastive and supervised baselines after fine-tuning. Pre-trained and fine-tuned checkpoints are released alongside the official PyTorch implementation.

#Applications

M3AE provides a pre-trained backbone that radiology and clinical-NLP researchers can fine-tune for downstream multimodal tasks such as answering questions about radiology images, retrieving relevant images from text queries (and vice versa), and classifying image-caption pairs. Because pre-training is label-free, it is well suited to medical settings where annotated data is limited but paired images and reports are plentiful, lowering the barrier to building task-specific models for radiology question answering, report-image alignment, and clinical decision-support prototypes.

#Impact

As one of the early demonstrations that masked-reconstruction pre-training transfers effectively to medical vision-language data, M3AE helped establish generative self-supervision as a viable alternative to contrastive methods such as CLIP-style alignment in the clinical domain. Its strong benchmark results on VQA-RAD, SLAKE, and related datasets made it a frequently cited baseline in subsequent medical multimodal foundation-model research, and its publicly released code and checkpoints have supported reproduction and extension by the community. The primary limitation is scale: M3AE was trained on radiology figure-caption corpora rather than the much larger web-scale datasets used by later general-domain models, so its coverage is concentrated in radiology imaging.

Citation

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Chen, Z., et al. (2022) Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-16443-9_65

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations183
Influential25
References30

GitHub

Stars132
Forks14
Open Issues10
Contributors1
Last Push3y ago
LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
29Closed
Usability — can I run it?24
Reproducibility — can I retrain it?23
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

autoencoderimage_text_retrievalmultimodalradiologyrepresentation_learningself_supervisedvision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperOfficial Website