A masked contrastive chest X-ray foundation model that aligns radiograph patches with report text for zero-shot and fine-grained diagnosis.
MaCo (Masked Contrastive) is a chest X-ray foundation model that learns transferable image representations by jointly aligning radiographs with their free-text clinical reports. It addresses a persistent tension in medical vision-language pretraining: methods that excel at coarse, global image-report matching (enabling zero-shot diagnosis) often lose the fine-grained, pixel-level understanding needed for grounding, segmentation, and detection, while methods optimized for dense prediction tend to sacrifice zero-shot transfer. MaCo aims to deliver both within a single pretraining recipe.
The model was developed by Weijian Huang, Cheng Li, Shanshan Wang and colleagues across the Shenzhen Institute of Advanced Technology (Chinese Academy of Sciences), Pengcheng Laboratory, Harvard University, and Shanghai AI Laboratory, and was published in Nature Communications in September 2024. Its central contribution is to combine masked image modeling with contrastive image-report alignment, and to introduce a correlation weighting mechanism that adjusts how strongly individual masked image patches are matched to the report.
By unifying these objectives, MaCo positions itself alongside contrastive medical vision-language models such as ConVIRT, GLoRIA, BioViL, and MGCA, but distinguishes itself through its granular, patch-level alignment strategy that supports both label-free prediction and dense downstream tasks.
MaCo pairs a ViT-B/16 image encoder with a BERT text encoder (width 768). It is pretrained on MIMIC-CXR v2, comprising 377,110 chest X-rays associated with 227,827 clinical reports, using a learnable temperature initialized at 0.03 and a loss weight of 0.9 balancing the reconstruction and contrastive terms. Pretraining runs in roughly 3.5 hours on four NVIDIA A100 GPUs at a batch size of 512. Across six open-source X-ray datasets, MaCo was reported to outperform 10 state-of-the-art approaches. Representative results include zero-shot classification AUCs of 77.3% on NIH ChestX-ray, 88.6% on RSNA, and 90.4% on SIIM; phrase grounding on MS-CXR at 25.5% mIoU (CNR 1.144); fully supervised segmentation Dice of 89.4% on SIIM and 75.1% on COVID Rural; detection at 19.2% mAP on RSNA; and fine-tuned classification AUCs of 88.9% on CheXpert and 85.9% on NIH.
MaCo is intended for computer-aided diagnosis and radiology research workflows where annotated data is scarce. Its zero-shot capability lets clinicians and researchers screen for pathologies using natural-language prompts without curating labeled training sets, while its phrase-grounding ability can localize findings described in a report to specific image regions, supporting explainable diagnosis. The shared backbone also serves as a strong initialization for downstream classification, segmentation, and detection pipelines, benefiting groups building chest X-ray analysis tools with limited labeled data.
By demonstrating that masked image modeling and contrastive report alignment can be reconciled through patch-level correlation weighting, MaCo contributes to the broader effort to build general-purpose medical imaging foundation models. Its publication in Nature Communications and release of code and pretrained weights under a permissive MIT license lower the barrier for reproducible benchmarking and downstream reuse in radiology AI. As with most chest X-ray models, its evaluation centers on MIMIC-CXR-style data, so generalization across imaging hardware, populations, and clinical settings remains an important consideration before deployment.
Huang, W., et al. (2023) Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning. Nature Communications.
DOI: 10.1038/s41467-024-51749-0Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data