PMC-CLIP

Biomedical vision-language model trained contrastively on 1.6M figure-caption pairs mined from PubMed Central open-access articles.

Released: March 2023

PMC-CLIP is a biomedical vision-language foundation model that adapts the contrastive language-image pre-training (CLIP) recipe to the medical domain by learning from figures and captions extracted from the scientific literature. A central obstacle to building general-purpose medical image-text models has been the scarcity of large, openly licensed paired data: biomedical images are spread across modalities (radiology, histopathology, microscopy, gross pathology) and most curated sets are small or behind clinical-privacy barriers. PMC-CLIP addresses this by mining PubMed Central's Open Access subset to build a corpus an order of magnitude larger than prior biomedical image-text collections.

The model and its companion dataset, PMC-OA, were introduced by Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie of Shanghai Jiao Tong University, and published at MICCAI 2023 (with the preprint released in March 2023). PMC-OA contains roughly 1.6 million image-caption pairs, about eight times larger than previous biomedical image-text datasets, and is constructed with finer-grained alignment that splits compound figures into subfigures matched to their corresponding subcaptions.

By pairing this dataset with a CLIP-style dual-encoder objective, PMC-CLIP provides reusable image and text representations that transfer to a range of downstream medical imaging tasks without task-specific data collection, positioning it alongside contemporaries such as BiomedCLIP and PubMedCLIP in the open biomedical vision-language landscape.

Key Features

Literature-scale paired corpus: The PMC-OA dataset of ~1.6M image-caption pairs is harvested from PubMed Central Open Access, roughly 8x larger than earlier biomedical image-text sets and spanning many imaging modalities.
Subfigure-level alignment: Compound figures are decomposed into subfigures and matched to subcaptions, yielding finer-grained, less noisy supervision than whole-figure pairing.
Dual-encoder with MLM auxiliary: A ResNet-50 image encoder and a transformer text encoder are trained with contrastive alignment plus a masked language modeling objective to strengthen the language side.
Open weights and data: Pre-trained checkpoints and the PMC-OA dataset are released under an MIT license via GitHub and HuggingFace for reuse.
Transferable representations: The learned embeddings support retrieval, zero-shot and fine-tuned classification, and serve as a backbone for medical visual question answering.

Technical Details

PMC-CLIP uses a dual-encoder contrastive architecture: a ResNet-50 image encoder (the RN50_fusion4 variant in the released code) and a transformer-based text encoder built on HuggingFace transformers, with a fusion module and an auxiliary masked language modeling loss complementing the standard image-text contrastive objective. Training is performed on the ~1.6M PMC-OA pairs. Evaluated as a pre-trained backbone, PMC-CLIP reports state-of-the-art results across several benchmarks at its release, including +8.1% R@10 on image-text retrieval (ROCO), +3.9% accuracy on image classification (MedMNIST), and improvements on Medical VQA. The official repository releases pre-trained checkpoints (the PMC_CLIP beta weights, plus extracted image and text encoders) and the PMC-OA dataset on HuggingFace.

Applications

PMC-CLIP serves researchers building medical image-text systems without the resources to assemble large paired datasets. Its embeddings support cross-modal retrieval (finding relevant figures from text queries and vice versa), zero-shot and fine-tuned classification across imaging modalities, and act as a pre-trained backbone for medical visual question answering and report-related tasks. Because both weights and data are openly available, it is a practical starting point for transfer learning in radiology, histopathology, and general biomedical imaging pipelines, and a baseline for subsequent biomedical vision-language research.

Impact

By demonstrating that PubMed Central figures and captions can be mined at scale into a high-quality training signal, PMC-CLIP helped establish literature-derived image-text corpora as a viable foundation for biomedical multimodal learning. The PMC-OA dataset and released checkpoints have been widely reused as benchmarks and backbones, and the model is frequently cited alongside BiomedCLIP and PubMedCLIP as a reference point for open medical CLIP systems. Its main limitations stem from the literature source: figures skew toward publication-worthy or illustrative cases and captions are written for expert readers, so coverage and label distribution differ from routine clinical data, and the ResNet-50 backbone is modest by current standards.

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Lin, W., et al. (2023) PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023.

DOI: 10.1007/978-3-031-43993-3_51

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References60

GitHub

Stars241

Forks18

Open Issues2

Contributors2

Last Push1y ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

63Partial

Usability — can I run it?71

Reproducibility — can I retrain it?60

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website Dataset Dataset

Key Features

Literature-scale paired corpus: The PMC-OA dataset of ~1.6M image-caption pairs is harvested from PubMed Central Open Access, roughly 8x larger than earlier biomedical image-text sets and spanning many imaging modalities.

Subfigure-level alignment: Compound figures are decomposed into subfigures and matched to subcaptions, yielding finer-grained, less noisy supervision than whole-figure pairing.

Dual-encoder with MLM auxiliary: A ResNet-50 image encoder and a transformer text encoder are trained with contrastive alignment plus a masked language modeling objective to strengthen the language side.

Open weights and data: Pre-trained checkpoints and the PMC-OA dataset are released under an MIT license via GitHub and HuggingFace for reuse.

Transferable representations: The learned embeddings support retrieval, zero-shot and fine-tuned classification, and serve as a backbone for medical visual question answering.

Technical Details

Applications

Impact

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Lin, W., et al. (2023) PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023.

DOI: 10.1007/978-3-031-43993-3_51

PMC-CLIP

Key Features

Technical Details

Applications

Impact

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PMC-CLIP

Key Features

Technical Details

Applications

Impact

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PMC-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PMC-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact