Boston University / University of Pittsburgh
Vision-language foundation model pre-trained on screening mammogram-report pairs to improve data efficiency and robustness in breast cancer detection.
Mammo-CLIP is a vision-language foundation model for mammography, introduced by researchers at Boston University and the University of Pittsburgh and presented at MICCAI 2024 (early accept, top 11%). It is the first CLIP-style model pre-trained on a substantial corpus of screening mammogram-report pairs, adapting the contrastive image-text pretraining paradigm to the breast-imaging domain. The model targets a persistent bottleneck in mammography AI: high-quality labeled mammograms are scarce, expensive to annotate, and unevenly distributed across institutions, which limits the diversity and size of training sets for breast cancer detection systems.
By learning a joint representation of mammographic images and their accompanying radiology reports, Mammo-CLIP produces image embeddings that transfer efficiently to downstream tasks with far fewer labeled examples than conventional supervised pipelines require. This data efficiency, together with improved robustness across datasets, is the model's central contribution. It situates mammography alongside the broader move in medical imaging toward report-supervised foundation models, where free-text clinical narratives serve as a rich, naturally occurring source of weak supervision.
Alongside the model, the authors propose Mammo-FActOR, a feature attribution method that maps learned representations back to individual sentences in the radiology report, providing spatially grounded, sentence-level interpretation of what the model has learned.
Mammo-CLIP follows the CLIP contrastive image-text architecture. The image encoder is an EfficientNet convolutional backbone (released in B5 and B2 variants), and the text encoder is BioClinicalBERT, a domain-adapted BERT pre-trained on clinical notes. The two encoders are aligned through a contrastive objective over mammogram-report pairs drawn from an in-house screening dataset; the authors additionally describe an image-label pretraining variant using the public VinDr-Mammo dataset. The model is evaluated on two public mammography datasets across tasks including classification and localization of mammographic attributes (such as masses and calcifications) relevant to breast cancer detection, where it improves over supervised baselines particularly in low-label regimes. The in-house image-text corpus is not publicly released, but the codebase documents the data format so practitioners can pretrain on their own paired data. Released checkpoints and downstream task models are distributed under a CC BY-NC-SA 4.0 license for non-commercial research and education.
Mammo-CLIP is aimed at researchers and clinical AI developers building breast cancer screening and triage tools. Its data-efficient embeddings make it a practical starting point for fine-tuning detectors and localizers when annotated mammograms are limited, and its cross-dataset robustness is valuable for groups deploying across institutions with differing imaging equipment and populations. The Mammo-FActOR attribution method supports model auditing and clinician-facing explanation by tying predictions to specific report findings. Because pretraining can be reproduced on local image-report pairs, hospitals can adapt the approach to their own data without sharing protected images.
As one of the first vision-language foundation models tailored to mammography, Mammo-CLIP demonstrated that report-supervised contrastive pretraining—already established in chest radiography—extends effectively to breast imaging and meaningfully reduces label requirements. Its MICCAI 2024 acceptance, together with checkpoints and code made publicly available under a non-commercial CC BY-NC-SA 4.0 license, have made it a reference point and reusable backbone for subsequent mammography AI research. Key limitations include the non-commercial license, reliance on a non-public in-house pretraining corpus, and evaluation focused on screening mammography rather than the full breadth of breast-imaging modalities.
Ghosh, S., et al. (2024) Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024.
DOI: 10.1007/978-3-031-72390-2_59Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data