CXR-CLIP

Large-scale chest X-ray vision-language pretraining model that learns image-report alignment for zero-shot and few-shot radiograph classification.

Released: October 2023

CXR-CLIP is a vision-language foundation model for chest radiography, developed by researchers at Kakao Brain (a research subsidiary of Kakao Corporation) and published at MICCAI 2023. It adapts the CLIP (Contrastive Language-Image Pre-training) paradigm to the medical imaging domain, learning a shared embedding space in which chest X-ray images and their corresponding radiology reports are aligned. Once trained, the model can classify findings on new radiographs in a zero-shot manner—using only natural-language label prompts and no task-specific labeled training data.

The central problem CXR-CLIP addresses is data scarcity in medical imaging. High-quality, expert-annotated chest X-ray datasets are expensive and time-consuming to assemble, which limits the scale at which conventional supervised classifiers can be trained. CXR-CLIP sidesteps this bottleneck by treating routinely generated radiology reports as a weak supervisory signal and by converting existing image-label datasets into image-text pairs through prompt engineering, allowing label-only collections to be folded into language-image pretraining.

The work sits within a growing family of medical CLIP variants (such as ConVIRT, GLoRIA, and BioViL) that bring contrastive image-text learning to radiology. CXR-CLIP's distinguishing contribution is its emphasis on scaling pretraining data and on capturing study-level structure—the fact that a single radiographic study often contains multiple images (e.g., frontal and lateral views) and a report composed of several sections.

Key Features

Zero-shot classification: Classifies pathologies on unseen chest X-rays from text prompts alone, without requiring labeled fine-tuning data for each new task.
Image-label to image-text conversion: Uses prompt engineering to recast image-label datasets as image-text pairs, expanding the pool of usable pretraining data beyond report-paired collections.
Study-level contrastive learning: Introduces two auxiliary contrastive losses—an image-specific contrastive loss (ICL) across multiple views and a text-specific contrastive loss (TCL) across report sections—to learn characteristics defined at the study level.
Multiple encoder backbones: Provides both a ResNet-50 (CNN) and a Swin-Tiny (vision transformer) image encoder, paired with a BioClinicalBERT text encoder.
Open weights and code: Pretrained checkpoints and training code are publicly released for both backbones across three dataset configurations.

Technical Details

CXR-CLIP couples an image encoder (ResNet-50 or Swin-Tiny) with the BioClinicalBERT text encoder, trained with a contrastive InfoNCE-style objective augmented by the image (ICL) and text (TCL) contrastive losses that align multiple views and report sections within a study. The model was pretrained on combinations of large public chest X-ray corpora—MIMIC-CXR, CheXpert, and ChestX-ray14—in three configurations of increasing scale (MIMIC-CXR alone; plus CheXpert; plus ChestX-ray14), where the label-only datasets were converted into text via prompt engineering. The authors report that CXR-CLIP outperforms state-of-the-art methods trained under the same conditions on classification, and that enlarging the pretraining dataset improves classification performance while incurring only a marginal trade-off in image-text retrieval. Evaluation spanned held-out benchmarks including VinDr-CXR, RSNA-Pneumonia, SIIM-Pneumothorax, and OpenI.

Applications

CXR-CLIP is most useful as a label-efficient backbone for chest radiograph interpretation. Its zero-shot capability lets researchers and clinical teams probe for new findings simply by writing text prompts, which is valuable for rare conditions where labeled data is scarce or for rapidly prototyping triage tools. The learned embeddings also support image-text retrieval—surfacing radiographs similar to a textual query or matching reports to images—and serve as strong pretrained features for downstream fine-tuning on specific classification tasks with limited annotations. The released ResNet-50 and Swin-Tiny checkpoints give practitioners a ready starting point for transfer learning in radiology pipelines.

Impact

CXR-CLIP contributed to the wave of medical vision-language models demonstrating that contrastive image-report pretraining can rival or exceed supervised baselines while drastically reducing annotation requirements. By open-sourcing both code and pretrained weights under a CC BY-NC 4.0 license, the Kakao Brain team lowered the barrier for other groups to build on study-aware contrastive pretraining for radiography. Its main limitations are those shared across the approach: the model is specialized to chest X-rays and inherits biases from the public training corpora, and the non-commercial license restricts certain downstream uses. Nonetheless, it remains a frequently referenced baseline in medical CLIP research.

Citations

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-43895-0_10

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Preprint

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.48550/arXiv.2310.13292

Recent citations

Papers that recently cited this model.

Medical report generation via knowledge distillation and medical keywords
Lili Huang, Yiming Cao, Xiaowei Zhao, et al.
Neurocomputing · 2026
0
Decoupling Language Guidance from Backbones for Text-Guided Medical Segmentation
Yung-Hsing Liu, Xuan Fang, Haijin Zeng, et al.
Jul 2026
0
APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms
Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Medical Image Analysis
Zongwei Zhou, V. Sodha, Jiaxuan Pang, et al.
458
Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
Yu Zhang, Xiusi Chen, Bowen Jin, et al.
Conference on Empirical Methods in Natural Language Processing · Jun 2024
126
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan, Seowung Leem, Kyle B. See, et al.
IEEE Reviews in Biomedical Engineering · Jun 2024
115
CLIP in Medical Imaging: A Comprehensive Survey
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
arXiv.org · 2023
107

Citations

Total Citations138

Influential15

References30

GitHub

Stars123

Forks15

Open Issues2

Contributors2

Last Push1y ago

LanguagePython

Fields of citing research

Computer Science99%
Medicine93%
Engineering13%
Physics3%
Biology1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

18Closed

Usability — can I run it?15

Reproducibility — can I retrain it?22

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Zero-shot classification: Classifies pathologies on unseen chest X-rays from text prompts alone, without requiring labeled fine-tuning data for each new task.

Image-label to image-text conversion: Uses prompt engineering to recast image-label datasets as image-text pairs, expanding the pool of usable pretraining data beyond report-paired collections.

Study-level contrastive learning: Introduces two auxiliary contrastive losses—an image-specific contrastive loss (ICL) across multiple views and a text-specific contrastive loss (TCL) across report sections—to learn characteristics defined at the study level.

Multiple encoder backbones: Provides both a ResNet-50 (CNN) and a Swin-Tiny (vision transformer) image encoder, paired with a BioClinicalBERT text encoder.

Open weights and code: Pretrained checkpoints and training code are publicly released for both backbones across three dataset configurations.

Technical Details

Applications

Impact

Citations

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-43895-0_10

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Preprint

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.48550/arXiv.2310.13292

Recent citations

Papers that recently cited this model.

Medical report generation via knowledge distillation and medical keywords

Lili Huang, Yiming Cao, Xiaowei Zhao, et al.

Neurocomputing · 2026

Decoupling Language Guidance from Backbones for Text-Guided Medical Segmentation

Yung-Hsing Liu, Xuan Fang, Haijin Zeng, et al.

Jul 2026

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.

Jun 2026

Top citations

The most-cited papers that cite this model.

Medical Image Analysis

Zongwei Zhou, V. Sodha, Jiaxuan Pang, et al.

458

Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions

Yuting He, Fuxiang Huang, Xinrui Jiang, et al.

IEEE Reviews in Biomedical Engineering · Apr 2024

134

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Yu Zhang, Xiusi Chen, Bowen Jin, et al.

Conference on Empirical Methods in Natural Language Processing · Jun 2024

126

A Comprehensive Survey of Foundation Models in Medicine

Wasif Khan, Seowung Leem, Kyle B. See, et al.

IEEE Reviews in Biomedical Engineering · Jun 2024

115

CLIP in Medical Imaging: A Comprehensive Survey

Zihao Zhao, Yuxiao Liu, Han Wu, et al.

arXiv.org · 2023

107

CXR-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citations

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Recent citations

Decoupling Language Guidance from Backbones for Text-Guided Medical Segmentation

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CXR-CLIP

#Key Features

#Technical Details

#Applications

#Impact

Citations

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Recent citations

Decoupling Language Guidance from Backbones for Text-Guided Medical Segmentation

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact