PubMedCLIP

Medical-domain CLIP fine-tuned on radiology image-caption pairs from ROCO, serving as a drop-in visual encoder for medical visual question answering.

Released: December 2021

PubMedCLIP is a medical-domain adaptation of OpenAI's CLIP (Contrastive Language-Image Pre-training), fine-tuned on radiology image-caption pairs to produce visual and text encoders better suited to medical imagery than general-purpose CLIP. It was developed by Sedigheh Eslami, Christoph Meinel, and Gerard de Melo at the Hasso Plattner Institute, first released as a preprint in December 2021 and published in the Findings of EACL 2023.

The work was motivated by a concrete question: does the strong transferability CLIP shows on natural images carry over to medicine, where images (X-rays, CT, MRI, histopathology) and the accompanying clinical language differ sharply from web photographs and captions? To test this, the authors continued CLIP's contrastive pre-training on the Radiology Objects in COntext (ROCO) dataset and then plugged the resulting encoder into established medical visual question answering (MedVQA) pipelines as a drop-in visual feature extractor.

PubMedCLIP is best understood as an early, focused entry in the family of biomedical CLIP variants. It is narrower in scope than later, larger models such as BiomedCLIP and PMC-CLIP, which were trained on millions of figure-caption pairs mined across the full PubMed Central literature; PubMedCLIP instead fine-tunes on the more curated, radiology-centric ROCO corpus and is evaluated primarily as a component within MedVQA systems rather than as a general foundation encoder.

Key Features

Medical-domain fine-tuning: Continues CLIP's image-text contrastive objective on ROCO radiology image-caption pairs, shifting the encoder toward clinical visual concepts and terminology.
Drop-in visual encoder: Designed to replace or augment the visual feature extractor in existing MedVQA architectures with minimal changes to the downstream pipeline.
Multiple backbones released: Checkpoints are provided for ResNet-50, ResNet-50x4, and ViT-B/32 image encoders, letting users trade off accuracy and compute.
Open and reproducible: Code and pre-trained weights are released under the MIT License, with integration scripts for the MEVF and QCR MedVQA frameworks.

Technical Details

PubMedCLIP keeps CLIP's dual-encoder design—a vision encoder (ResNet-50, ResNet-50x4, or ViT-B/32) paired with a Transformer text encoder—and fine-tunes it with the symmetric image-text contrastive loss on the ROCO dataset, which contains roughly 80,000 radiology image-caption pairs sourced from open-access PubMed Central articles. The fine-tuned visual encoder is then frozen and used to supply image features to two MedVQA systems, MEVF and QCR, replacing or complementing their MAML-based visual networks. Evaluated on the VQA-RAD and SLAKE MedVQA benchmarks, PubMedCLIP improves overall answer accuracy by up to roughly 3% over the baseline visual encoders, with the largest gains on the ViT-B/32 backbone. The study also surfaces distributional differences between the two benchmarks and highlights how MedVQA behaves differently from general-domain VQA.

Applications

PubMedCLIP is primarily used as a pre-trained visual (and text) encoder for medical imaging tasks where paired image-text supervision is scarce. Its most direct application is medical visual question answering, but the encoder is also useful for image-text retrieval, zero-shot or few-shot classification of radiology images, and as an initialization for downstream medical vision models. Researchers building MedVQA systems or benchmarking biomedical vision-language encoders benefit most, and the released backbones make it straightforward to slot PubMedCLIP into existing pipelines.

Impact

As one of the first demonstrations that CLIP-style contrastive pre-training can be specialized for radiology and improve MedVQA, PubMedCLIP helped establish the template that later, larger biomedical CLIP models—including PMC-CLIP and BiomedCLIP—would scale up. It remains a widely cited reference and a common baseline encoder in medical vision-language research, and its openly released weights continue to be reused, including through community HuggingFace mirrors. Its main limitations are its comparatively small, radiology-focused training corpus and an evaluation centered on two MedVQA benchmarks, which bound how broadly its representations generalize relative to later foundation-scale models.

Citation

PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?

Eslami, S., et al. (2023) PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?. Findings.

DOI: 10.18653/v1/2023.findings-eacl.88

Recent citations

Papers that recently cited this model.

Research on the application of LLaVA model based on QLoRA fine-tuning in medical teaching
Shiling Zhou, Fengmei Qin
PLoS ONE · Jul 2026
0
TCLA: Training-Free Class-wise Logit Adaptation for Medical Vision-Language Models
Tianyou Jiang, Ziyu Zhou
Jul 2026
0
MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models
Hyunjae Kim, Dain Kim, Pan Xiao, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Chunyuan Li, Cliff Wong, Sheng Zhang, et al.
Neural Information Processing Systems · Jun 2023
1.8K
A visual–language foundation model for pathology image analysis using medical Twitter
Zhi Huang, Federico Bianchi, Mert Yuksekgonul, et al.
Nature Medicine · Aug 2023
767
Vision-language models for medical report generation and visual question answering: a review
Iryna Hartsock, Ghulam Rasool
Frontiers Artif. Intell. · Mar 2024
246
ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset
Johannes Rückert, Louise Bloch, Raphael Brüngel, et al.
Scientific Data · May 2024
108
CLIP in Medical Imaging: A Comprehensive Survey
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
arXiv.org · 2023
107

Citations

Total Citations311

Influential16

References41

GitHub

Stars183

Forks32

Open Issues9

Contributors2

Last Push1y ago

LanguagePython

LicenseMIT

HuggingFace

Downloads6.7K

Likes23

Last Modified2y ago

Pipelinezero-shot-image-classification

Fields of citing research

Computer Science99%
Medicine89%
Engineering11%
Environmental Science2%
Linguistics2%
Biology1%
Art1%
Philosophy1%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

75Open

Usability — can I run it?99

Reproducibility — can I retrain it?48

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Research Paper HuggingFace Model

Key Features

Medical-domain fine-tuning: Continues CLIP's image-text contrastive objective on ROCO radiology image-caption pairs, shifting the encoder toward clinical visual concepts and terminology.

Drop-in visual encoder: Designed to replace or augment the visual feature extractor in existing MedVQA architectures with minimal changes to the downstream pipeline.

Multiple backbones released: Checkpoints are provided for ResNet-50, ResNet-50x4, and ViT-B/32 image encoders, letting users trade off accuracy and compute.

Open and reproducible: Code and pre-trained weights are released under the MIT License, with integration scripts for the MEVF and QCR MedVQA frameworks.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Research on the application of LLaVA model based on QLoRA fine-tuning in medical teaching

Shiling Zhou, Fengmei Qin

PLoS ONE · Jul 2026

TCLA: Training-Free Class-wise Logit Adaptation for Medical Vision-Language Models

Tianyou Jiang, Ziyu Zhou

Jul 2026

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Hyunjae Kim, Dain Kim, Pan Xiao, et al.

Jul 2026

PubMedCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?

Recent citations

TCLA: Training-Free Class-wise Logit Adaptation for Medical Vision-Language Models

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PubMedCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?

Recent citations

TCLA: Training-Free Class-wise Logit Adaptation for Medical Vision-Language Models

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact