Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University
Self-supervised medical vision-and-language pretraining via multi-modal masked autoencoders that reconstruct masked image patches and text tokens.
M3AE (Multi-Modal Masked Autoencoders) is a self-supervised framework for medical vision-and-language pre-training that learns transferable cross-modal representations by reconstructing missing content from jointly masked medical images and their accompanying text. Introduced by Zhihong Chen and colleagues at the Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong, Shenzhen, and Sun Yat-sen University, the work was published at MICCAI 2022. It extends the masked-autoencoder paradigm—popularized for images by MAE and for language by BERT—into a unified multi-modal objective tailored to the medical domain, where paired radiology images and reports are abundant but expert labels are scarce.
The central insight is that images and text carry very different information densities, so a single shared masking strategy is suboptimal. M3AE applies an asymmetric masking scheme: a high masking ratio for image patches (which are spatially redundant) and a lower ratio for text tokens (which are information-dense). By forcing the model to reconstruct masked pixels and masked words from the surviving cross-modal context, M3AE learns to align visual and textual concepts without requiring any manual annotations during pre-training.
Positioned among early medical vision-language foundation models, M3AE demonstrated that generative masked reconstruction—rather than purely contrastive alignment—can serve as an effective pre-training signal for clinical multimodal data, providing strong initialization for a range of downstream radiology tasks.
M3AE pre-trains on paired medical image-text data drawn from ROCO (Radiology Objects in Context) and MedICaT, two large collections of radiology figures with captions. The architecture couples a vision-transformer image encoder and a transformer text encoder whose multi-layer features feed a Transformer visual decoder and an MLP textual decoder; training minimizes a combined pixel-reconstruction and masked-language-modeling loss. The authors also constructed a medical vision-and-language benchmark spanning three task families to evaluate transfer. On downstream evaluation, M3AE achieves state-of-the-art results across medical visual question answering (VQA-RAD, SLAKE, and the VQA-Med 2019 set), medical image-text classification (MELINDA), and medical image-caption retrieval (ROCO), outperforming prior contrastive and supervised baselines after fine-tuning. Pre-trained and fine-tuned checkpoints are released alongside the official PyTorch implementation.
M3AE provides a pre-trained backbone that radiology and clinical-NLP researchers can fine-tune for downstream multimodal tasks such as answering questions about radiology images, retrieving relevant images from text queries (and vice versa), and classifying image-caption pairs. Because pre-training is label-free, it is well suited to medical settings where annotated data is limited but paired images and reports are plentiful, lowering the barrier to building task-specific models for radiology question answering, report-image alignment, and clinical decision-support prototypes.
As one of the early demonstrations that masked-reconstruction pre-training transfers effectively to medical vision-language data, M3AE helped establish generative self-supervision as a viable alternative to contrastive methods such as CLIP-style alignment in the clinical domain. Its strong benchmark results on VQA-RAD, SLAKE, and related datasets made it a frequently cited baseline in subsequent medical multimodal foundation-model research, and its publicly released code and checkpoints have supported reproduction and extension by the community. The primary limitation is scale: M3AE was trained on radiology figure-caption corpora rather than the much larger web-scale datasets used by later general-domain models, so its coverage is concentrated in radiology imaging.
Chen, Z., et al. (2022) Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.1007/978-3-031-16443-9_65Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data