A CLIP model fine-tuned on ROCO medical image-caption pairs to provide a medical-domain visual encoder for tasks such as medical visual question answering.
PubMedCLIP is a medical-domain adaptation of OpenAI's CLIP (Contrastive Language-Image Pre-training), fine-tuned on radiology image-caption pairs to produce visual and text encoders better suited to medical imagery than general-purpose CLIP. It was developed by Sedigheh Eslami, Christoph Meinel, and Gerard de Melo at the Hasso Plattner Institute, first released as a preprint in December 2021 and published in the Findings of EACL 2023.
The work was motivated by a concrete question: does the strong transferability CLIP shows on natural images carry over to medicine, where images (X-rays, CT, MRI, histopathology) and the accompanying clinical language differ sharply from web photographs and captions? To test this, the authors continued CLIP's contrastive pre-training on the Radiology Objects in COntext (ROCO) dataset and then plugged the resulting encoder into established medical visual question answering (MedVQA) pipelines as a drop-in visual feature extractor.
PubMedCLIP is best understood as an early, focused entry in the family of biomedical CLIP variants. It is narrower in scope than later, larger models such as BiomedCLIP and PMC-CLIP, which were trained on millions of figure-caption pairs mined across the full PubMed Central literature; PubMedCLIP instead fine-tunes on the more curated, radiology-centric ROCO corpus and is evaluated primarily as a component within MedVQA systems rather than as a general foundation encoder.
PubMedCLIP keeps CLIP's dual-encoder design—a vision encoder (ResNet-50, ResNet-50x4, or ViT-B/32) paired with a Transformer text encoder—and fine-tunes it with the symmetric image-text contrastive loss on the ROCO dataset, which contains roughly 80,000 radiology image-caption pairs sourced from open-access PubMed Central articles. The fine-tuned visual encoder is then frozen and used to supply image features to two MedVQA systems, MEVF and QCR, replacing or complementing their MAML-based visual networks. Evaluated on the VQA-RAD and SLAKE MedVQA benchmarks, PubMedCLIP improves overall answer accuracy by up to roughly 3% over the baseline visual encoders, with the largest gains on the ViT-B/32 backbone. The study also surfaces distributional differences between the two benchmarks and highlights how MedVQA behaves differently from general-domain VQA.
PubMedCLIP is primarily used as a pre-trained visual (and text) encoder for medical imaging tasks where paired image-text supervision is scarce. Its most direct application is medical visual question answering, but the encoder is also useful for image-text retrieval, zero-shot or few-shot classification of radiology images, and as an initialization for downstream medical vision models. Researchers building MedVQA systems or benchmarking biomedical vision-language encoders benefit most, and the released backbones make it straightforward to slot PubMedCLIP into existing pipelines.
As one of the first demonstrations that CLIP-style contrastive pre-training can be specialized for radiology and improve MedVQA, PubMedCLIP helped establish the template that later, larger biomedical CLIP models—including PMC-CLIP and BiomedCLIP—would scale up. It remains a widely cited reference and a common baseline encoder in medical vision-language research, and its openly released weights continue to be reused, including through community HuggingFace mirrors. Its main limitations are its comparatively small, radiology-focused training corpus and an evaluation centered on two MedVQA benchmarks, which bound how broadly its representations generalize relative to later foundation-scale models.
Eslami, S., et al. (2023) PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?. Findings.
DOI: 10.18653/v1/2023.findings-eacl.88Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data