A biomedical vision-language model trained with contrastive learning on 1.6M image-caption pairs (PMC-OA) mined from PubMed Central open-access articles.
PMC-CLIP is a biomedical vision-language foundation model that adapts the contrastive language-image pre-training (CLIP) recipe to the medical domain by learning from figures and captions extracted from the scientific literature. A central obstacle to building general-purpose medical image-text models has been the scarcity of large, openly licensed paired data: biomedical images are spread across modalities (radiology, histopathology, microscopy, gross pathology) and most curated sets are small or behind clinical-privacy barriers. PMC-CLIP addresses this by mining PubMed Central's Open Access subset to build a corpus an order of magnitude larger than prior biomedical image-text collections.
The model and its companion dataset, PMC-OA, were introduced by Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie of Shanghai Jiao Tong University, and published at MICCAI 2023 (with the preprint released in March 2023). PMC-OA contains roughly 1.6 million image-caption pairs, about eight times larger than previous biomedical image-text datasets, and is constructed with finer-grained alignment that splits compound figures into subfigures matched to their corresponding subcaptions.
By pairing this dataset with a CLIP-style dual-encoder objective, PMC-CLIP provides reusable image and text representations that transfer to a range of downstream medical imaging tasks without task-specific data collection, positioning it alongside contemporaries such as BiomedCLIP and PubMedCLIP in the open biomedical vision-language landscape.
PMC-CLIP uses a dual-encoder contrastive architecture: a ResNet-50 image encoder (the RN50_fusion4 variant in the released code) and a transformer-based text encoder built on HuggingFace transformers, with a fusion module and an auxiliary masked language modeling loss complementing the standard image-text contrastive objective. Training is performed on the ~1.6M PMC-OA pairs. Evaluated as a pre-trained backbone, PMC-CLIP reports state-of-the-art results across several benchmarks at its release, including +8.1% R@10 on image-text retrieval (ROCO), +3.9% accuracy on image classification (MedMNIST), and improvements on Medical VQA. The official repository releases pre-trained checkpoints (the PMC_CLIP beta weights, plus extracted image and text encoders) and the PMC-OA dataset on HuggingFace.
PMC-CLIP serves researchers building medical image-text systems without the resources to assemble large paired datasets. Its embeddings support cross-modal retrieval (finding relevant figures from text queries and vice versa), zero-shot and fine-tuned classification across imaging modalities, and act as a pre-trained backbone for medical visual question answering and report-related tasks. Because both weights and data are openly available, it is a practical starting point for transfer learning in radiology, histopathology, and general biomedical imaging pipelines, and a baseline for subsequent biomedical vision-language research.
By demonstrating that PubMed Central figures and captions can be mined at scale into a high-quality training signal, PMC-CLIP helped establish literature-derived image-text corpora as a viable foundation for biomedical multimodal learning. The PMC-OA dataset and released checkpoints have been widely reused as benchmarks and backbones, and the model is frequently cited alongside BiomedCLIP and PubMedCLIP as a reference point for open medical CLIP systems. Its main limitations stem from the literature source: figures skew toward publication-worthy or illustrative cases and captions are written for expert readers, so coverage and label distribution differ from routine clinical data, and the ResNet-50 backbone is modest by current standards.
Lin, W., et al. (2023) PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023.
DOI: 10.1007/978-3-031-43993-3_51Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data