M3AE

Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University

Self-supervised medical vision-and-language pretraining via multi-modal masked autoencoders that reconstruct masked image patches and text tokens.

Released: September 2022

M3AE (Multi-Modal Masked Autoencoders) is a self-supervised framework for medical vision-and-language pre-training that learns transferable cross-modal representations by reconstructing missing content from jointly masked medical images and their accompanying text. Introduced by Zhihong Chen and colleagues at the Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong, Shenzhen, and Sun Yat-sen University, the work was published at MICCAI 2022. It extends the masked-autoencoder paradigm—popularized for images by MAE and for language by BERT—into a unified multi-modal objective tailored to the medical domain, where paired radiology images and reports are abundant but expert labels are scarce.

The central insight is that images and text carry very different information densities, so a single shared masking strategy is suboptimal. M3AE applies an asymmetric masking scheme: a high masking ratio for image patches (which are spatially redundant) and a lower ratio for text tokens (which are information-dense). By forcing the model to reconstruct masked pixels and masked words from the surviving cross-modal context, M3AE learns to align visual and textual concepts without requiring any manual annotations during pre-training.

Positioned among early medical vision-language foundation models, M3AE demonstrated that generative masked reconstruction—rather than purely contrastive alignment—can serve as an effective pre-training signal for clinical multimodal data, providing strong initialization for a range of downstream radiology tasks.

Key Features

Asymmetric multi-modal masking: Uses a considerably larger masking ratio for images than for text, reflecting the lower information density of visual patches versus language tokens.
Dual reconstruction objective: Jointly recovers masked image pixels and masked text tokens, learning fine-grained cross-modal correspondence as a byproduct of the reconstruction task.
Modality-specific decoders: Pairs a Transformer-based decoder for vision with a lightweight MLP decoder for language, matching each decoder to the abstraction level of its modality.
Multi-layer feature fusion: Draws visual and textual features from multiple encoder layers to handle the differing levels of abstraction across vision and language.
Self-supervised, label-free pre-training: Requires only paired medical image-caption data, avoiding the cost of expert annotation while producing reusable representations.

Technical Details

M3AE pre-trains on paired medical image-text data drawn from ROCO (Radiology Objects in Context) and MedICaT, two large collections of radiology figures with captions. The architecture couples a vision-transformer image encoder and a transformer text encoder whose multi-layer features feed a Transformer visual decoder and an MLP textual decoder; training minimizes a combined pixel-reconstruction and masked-language-modeling loss. The authors also constructed a medical vision-and-language benchmark spanning three task families to evaluate transfer. On downstream evaluation, M3AE achieves state-of-the-art results across medical visual question answering (VQA-RAD, SLAKE, and the VQA-Med 2019 set), medical image-text classification (MELINDA), and medical image-caption retrieval (ROCO), outperforming prior contrastive and supervised baselines after fine-tuning. Pre-trained and fine-tuned checkpoints are released alongside the official PyTorch implementation.

Applications

M3AE provides a pre-trained backbone that radiology and clinical-NLP researchers can fine-tune for downstream multimodal tasks such as answering questions about radiology images, retrieving relevant images from text queries (and vice versa), and classifying image-caption pairs. Because pre-training is label-free, it is well suited to medical settings where annotated data is limited but paired images and reports are plentiful, lowering the barrier to building task-specific models for radiology question answering, report-image alignment, and clinical decision-support prototypes.

Impact

As one of the early demonstrations that masked-reconstruction pre-training transfers effectively to medical vision-language data, M3AE helped establish generative self-supervision as a viable alternative to contrastive methods such as CLIP-style alignment in the clinical domain. Its strong benchmark results on VQA-RAD, SLAKE, and related datasets made it a frequently cited baseline in subsequent medical multimodal foundation-model research, and its publicly released code and checkpoints have supported reproduction and extension by the community. The primary limitation is scale: M3AE was trained on radiology figure-caption corpora rather than the much larger web-scale datasets used by later general-domain models, so its coverage is concentrated in radiology imaging.

Citation

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Chen, Z., et al. (2022) Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-16443-9_65

Recent citations

Papers that recently cited this model.

Research on the application of LLaVA model based on QLoRA fine-tuning in medical teaching
Shiling Zhou, Fengmei Qin
PLoS ONE · Jul 2026
0
A knowledge enhanced framework for interpretable medical visual question and answering via large foundation model
Xinyan Deng, Yinxin Xu, Xiaorou Zheng, et al.
Multimedia Systems · Jul 2026
0
Lar-Net: A hierarchical referring image segmentation framework for poisonous weed identification on the Tibetan Plateau
Qing Dong, Chunmei Li, Hao Wang, et al.
Ecological Informatics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, et al.
arXiv.org · May 2023
347
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Mar 2023
313
Pre-trained Language Models in Biomedical Domain: A Systematic Survey
Benyou Wang, Qianqian Xie, Jiahuan Pei, et al.
ACM Computing Surveys · Oct 2021
238
Large AI Models in Health Informatics: Applications, Challenges, and the Future
Jianing Qiu, Lin Li, Jiankai Sun, et al.
IEEE journal of biomedical and health informatics · Mar 2023
213
Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134

Citations

Total Citations192

Influential25

References30

GitHub

Stars134

Forks14

Open Issues10

Contributors1

Last Push3y ago

LanguagePython

Fields of citing research

Computer Science100%
Medicine88%
Engineering16%
Environmental Science2%
Biology2%
Physics1%
Agricultural and Food Sciences1%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

29Closed

Usability — can I run it?24

Reproducibility — can I retrain it?23

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website

Key Features

Asymmetric multi-modal masking: Uses a considerably larger masking ratio for images than for text, reflecting the lower information density of visual patches versus language tokens.

Dual reconstruction objective: Jointly recovers masked image pixels and masked text tokens, learning fine-grained cross-modal correspondence as a byproduct of the reconstruction task.

Modality-specific decoders: Pairs a Transformer-based decoder for vision with a lightweight MLP decoder for language, matching each decoder to the abstraction level of its modality.

Multi-layer feature fusion: Draws visual and textual features from multiple encoder layers to handle the differing levels of abstraction across vision and language.

Self-supervised, label-free pre-training: Requires only paired medical image-caption data, avoiding the cost of expert annotation while producing reusable representations.

Technical Details

Applications

Impact

Citation

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Chen, Z., et al. (2022) Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-16443-9_65

M3AE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

M3AE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact