Mammo-CLIP

Boston University / University of Pittsburgh

Vision-language foundation model pre-trained on screening mammogram-report pairs to improve data efficiency and robustness in breast cancer detection.

Released: May 2024

Mammo-CLIP is a vision-language foundation model for mammography, introduced by researchers at Boston University and the University of Pittsburgh and presented at MICCAI 2024 (early accept, top 11%). It is the first CLIP-style model pre-trained on a substantial corpus of screening mammogram-report pairs, adapting the contrastive image-text pretraining paradigm to the breast-imaging domain. The model targets a persistent bottleneck in mammography AI: high-quality labeled mammograms are scarce, expensive to annotate, and unevenly distributed across institutions, which limits the diversity and size of training sets for breast cancer detection systems.

By learning a joint representation of mammographic images and their accompanying radiology reports, Mammo-CLIP produces image embeddings that transfer efficiently to downstream tasks with far fewer labeled examples than conventional supervised pipelines require. This data efficiency, together with improved robustness across datasets, is the model's central contribution. It situates mammography alongside the broader move in medical imaging toward report-supervised foundation models, where free-text clinical narratives serve as a rich, naturally occurring source of weak supervision.

Alongside the model, the authors propose Mammo-FActOR, a feature attribution method that maps learned representations back to individual sentences in the radiology report, providing spatially grounded, sentence-level interpretation of what the model has learned.

Key Features

Report-supervised pretraining: Learns from paired screening mammograms and the FINDINGS/IMPRESSION sections of radiology reports, using clinical text as weak supervision instead of relying solely on manual image labels.
Data efficiency: Transfers to downstream classification and localization tasks with substantially fewer labeled examples, addressing the chronic scarcity of annotated mammograms.
Cross-dataset robustness: Demonstrates strong, stable performance on public benchmarks, mitigating the diversity and size limitations of single-institution datasets.
Mammo-FActOR interpretability: A novel feature attribution method that links representations to sentence-level report content, supporting spatial and clinical interpretation of predictions.
Publicly available weights and code: Pre-trained EfficientNet-B5 and EfficientNet-B2 checkpoints and the full PyTorch implementation are publicly available under a non-commercial CC BY-NC-SA 4.0 license.

Technical Details

Mammo-CLIP follows the CLIP contrastive image-text architecture. The image encoder is an EfficientNet convolutional backbone (released in B5 and B2 variants), and the text encoder is BioClinicalBERT, a domain-adapted BERT pre-trained on clinical notes. The two encoders are aligned through a contrastive objective over mammogram-report pairs drawn from an in-house screening dataset; the authors additionally describe an image-label pretraining variant using the public VinDr-Mammo dataset. The model is evaluated on two public mammography datasets across tasks including classification and localization of mammographic attributes (such as masses and calcifications) relevant to breast cancer detection, where it improves over supervised baselines particularly in low-label regimes. The in-house image-text corpus is not publicly released, but the codebase documents the data format so practitioners can pretrain on their own paired data. Released checkpoints and downstream task models are distributed under a CC BY-NC-SA 4.0 license for non-commercial research and education.

Applications

Mammo-CLIP is aimed at researchers and clinical AI developers building breast cancer screening and triage tools. Its data-efficient embeddings make it a practical starting point for fine-tuning detectors and localizers when annotated mammograms are limited, and its cross-dataset robustness is valuable for groups deploying across institutions with differing imaging equipment and populations. The Mammo-FActOR attribution method supports model auditing and clinician-facing explanation by tying predictions to specific report findings. Because pretraining can be reproduced on local image-report pairs, hospitals can adapt the approach to their own data without sharing protected images.

Impact

As one of the first vision-language foundation models tailored to mammography, Mammo-CLIP demonstrated that report-supervised contrastive pretraining—already established in chest radiography—extends effectively to breast imaging and meaningfully reduces label requirements. Its MICCAI 2024 acceptance, together with checkpoints and code made publicly available under a non-commercial CC BY-NC-SA 4.0 license, have made it a reference point and reusable backbone for subsequent mammography AI research. Key limitations include the non-commercial license, reliance on a non-public in-house pretraining corpus, and evaluation focused on screening mammography rather than the full breadth of breast-imaging modalities.

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Ghosh, S., et al. (2024) Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024.

DOI: 10.1007/978-3-031-72390-2_59

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References63

GitHub

Stars97

Forks34

Open Issues0

Contributors3

Last Push3mo ago

LanguagePython

HuggingFace

Downloads45

Likes3

Last Modified8mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

27Closed

Usability — can I run it?22

Reproducibility — can I retrain it?16

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Report-supervised pretraining: Learns from paired screening mammograms and the FINDINGS/IMPRESSION sections of radiology reports, using clinical text as weak supervision instead of relying solely on manual image labels.

Data efficiency: Transfers to downstream classification and localization tasks with substantially fewer labeled examples, addressing the chronic scarcity of annotated mammograms.

Cross-dataset robustness: Demonstrates strong, stable performance on public benchmarks, mitigating the diversity and size limitations of single-institution datasets.

Mammo-FActOR interpretability: A novel feature attribution method that links representations to sentence-level report content, supporting spatial and clinical interpretation of predictions.

Publicly available weights and code: Pre-trained EfficientNet-B5 and EfficientNet-B2 checkpoints and the full PyTorch implementation are publicly available under a non-commercial CC BY-NC-SA 4.0 license.

Technical Details

Applications

Impact

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

DOI: 10.1007/978-3-031-72390-2_59

Mammo-CLIP

Key Features

Technical Details

Applications

Impact

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Mammo-CLIP

Key Features

Technical Details

Applications

Impact

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Mammo-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Mammo-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact