MaCo

Chest X-ray foundation model that pairs masked image modeling with image-report contrastive alignment for zero-shot diagnosis and phrase grounding.

Released: September 2024

MaCo (Masked Contrastive) is a chest X-ray foundation model that learns transferable image representations by jointly aligning radiographs with their free-text clinical reports. It addresses a persistent tension in medical vision-language pretraining: methods that excel at coarse, global image-report matching (enabling zero-shot diagnosis) often lose the fine-grained, pixel-level understanding needed for grounding, segmentation, and detection, while methods optimized for dense prediction tend to sacrifice zero-shot transfer. MaCo aims to deliver both within a single pretraining recipe.

The model was developed by Weijian Huang, Cheng Li, Shanshan Wang and colleagues across the Shenzhen Institute of Advanced Technology (Chinese Academy of Sciences), Pengcheng Laboratory, Harvard University, and Shanghai AI Laboratory, and was published in Nature Communications in September 2024. Its central contribution is to combine masked image modeling with contrastive image-report alignment, and to introduce a correlation weighting mechanism that adjusts how strongly individual masked image patches are matched to the report.

By unifying these objectives, MaCo positions itself alongside contrastive medical vision-language models such as ConVIRT, GLoRIA, BioViL, and MGCA, but distinguishes itself through its granular, patch-level alignment strategy that supports both label-free prediction and dense downstream tasks.

Key Features

Masked contrastive pretraining: Combines a masked autoencoding reconstruction objective with cross-modal contrastive learning, so the encoder simultaneously captures local image structure and global image-report semantics.
Correlation weighting mechanism: A learnable module generates per-patch importance scores from the masked position map (via a softplus-activated fully connected layer), reweighting both the contrastive logits and the loss to prioritize report-relevant regions.
Zero-shot diagnosis: Performs label-free classification by comparing image embeddings to text prompts, removing the need for task-specific annotated training data.
Granular downstream transfer: A single pretrained backbone supports fine-tuning for classification, semantic segmentation, object detection, and phrase grounding.
Open implementation and weights: Released under an MIT license with MAE and MaCo checkpoints, plus task pipelines for fine-tuning, segmentation (MMSegmentation), and detection (ViTDet).

Technical Details

MaCo pairs a ViT-B/16 image encoder with a BERT text encoder (width 768). It is pretrained on MIMIC-CXR v2, comprising 377,110 chest X-rays associated with 227,827 clinical reports, using a learnable temperature initialized at 0.03 and a loss weight of 0.9 balancing the reconstruction and contrastive terms. Pretraining runs in roughly 3.5 hours on four NVIDIA A100 GPUs at a batch size of 512. Across six open-source X-ray datasets, MaCo was reported to outperform 10 state-of-the-art approaches. Representative results include zero-shot classification AUCs of 77.3% on NIH ChestX-ray, 88.6% on RSNA, and 90.4% on SIIM; phrase grounding on MS-CXR at 25.5% mIoU (CNR 1.144); fully supervised segmentation Dice of 89.4% on SIIM and 75.1% on COVID Rural; detection at 19.2% mAP on RSNA; and fine-tuned classification AUCs of 88.9% on CheXpert and 85.9% on NIH.

Applications

MaCo is intended for computer-aided diagnosis and radiology research workflows where annotated data is scarce. Its zero-shot capability lets clinicians and researchers screen for pathologies using natural-language prompts without curating labeled training sets, while its phrase-grounding ability can localize findings described in a report to specific image regions, supporting explainable diagnosis. The shared backbone also serves as a strong initialization for downstream classification, segmentation, and detection pipelines, benefiting groups building chest X-ray analysis tools with limited labeled data.

Impact

By demonstrating that masked image modeling and contrastive report alignment can be reconciled through patch-level correlation weighting, MaCo contributes to the broader effort to build general-purpose medical imaging foundation models. Its publication in Nature Communications and release of code and pretrained weights under a permissive MIT license lower the barrier for reproducible benchmarking and downstream reuse in radiology AI. As with most chest X-ray models, its evaluation centers on MIMIC-CXR-style data, so generalization across imaging hardware, populations, and clinical settings remains an important consideration before deployment.

Citation

Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning

Huang, W., et al. (2023) Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning. Nature Communications.

DOI: 10.1038/s41467-024-51749-0

Recent citations

Papers that recently cited this model.

An interpretable multimodal transformer for medical report generation via hierarchical semantics and clinical labeling
Jia Sheng Yang, Chenbo Xia, Wei Li, et al.
Engineering applications of artificial intelligence · Jul 2026
0
Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions
Le Zou, Mengyu Ma, Jun Li, et al.
Italian National Conference on Sensors · Jun 2026
0
Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling
Zhemin Zhang, Weijie Chen, David Le, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
Yu Zhang, Xiusi Chen, Bowen Jin, et al.
Conference on Empirical Methods in Natural Language Processing · Jun 2024
126
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad, Reza Azad, Sania Eskandari, et al.
arXiv.org · Oct 2023
125
Swin-UMamba†: Adapting Mamba-Based Vision Foundation Models for Medical Image Segmentation
Jiarun Liu, Hao Yang, Hong-Yu Zhou, et al.
IEEE Transactions on Medical Imaging · Nov 2024
90
Multimodal Large Language Models in Medical Imaging: Current State and Future Directions
Yoojin Nam, Dong Yeong Kim, Sunggu Kyung, et al.
Korean Journal of Radiology · Aug 2025
60
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
Yingshu Li, Yunyi Liu, Zhanyu Wang, et al.
medRxiv · Oct 2023
46

Citations

Total Citations76

Influential0

References61

GitHub

Stars12

Forks0

Open Issues1

Contributors1

Last Push1y ago

LanguagePython

LicenseMIT

Fields of citing research

Computer Science96%
Medicine91%
Engineering28%
Physics1%
Mathematics1%
Biology1%
Psychology1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

74Open

Usability — can I run it?94

Reproducibility — can I retrain it?57

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository GitHub Repository Research Paper

Key Features

Masked contrastive pretraining: Combines a masked autoencoding reconstruction objective with cross-modal contrastive learning, so the encoder simultaneously captures local image structure and global image-report semantics.

Correlation weighting mechanism: A learnable module generates per-patch importance scores from the masked position map (via a softplus-activated fully connected layer), reweighting both the contrastive logits and the loss to prioritize report-relevant regions.

Zero-shot diagnosis: Performs label-free classification by comparing image embeddings to text prompts, removing the need for task-specific annotated training data.

Granular downstream transfer: A single pretrained backbone supports fine-tuning for classification, semantic segmentation, object detection, and phrase grounding.

Open implementation and weights: Released under an MIT license with MAE and MaCo checkpoints, plus task pipelines for fine-tuning, segmentation (MMSegmentation), and detection (ViTDet).

Technical Details

Applications

Impact

Citation

Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning

Huang, W., et al. (2023) Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning. Nature Communications.

DOI: 10.1038/s41467-024-51749-0

Recent citations

Papers that recently cited this model.

An interpretable multimodal transformer for medical report generation via hierarchical semantics and clinical labeling

Jia Sheng Yang, Chenbo Xia, Wei Li, et al.

Engineering applications of artificial intelligence · Jul 2026

Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions

Le Zou, Mengyu Ma, Jun Li, et al.

Italian National Conference on Sensors · Jun 2026

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Zhemin Zhang, Weijie Chen, David Le, et al.

Jun 2026

MaCo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning

Recent citations

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MaCo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning

Recent citations

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact