bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

CXR-CLIP

Kakao Brain

Large-scale chest X-ray vision-language pretraining model that learns image-report alignment for zero-shot and few-shot radiograph classification.

Released: October 2023

CXR-CLIP is a vision-language foundation model for chest radiography, developed by researchers at Kakao Brain (a research subsidiary of Kakao Corporation) and published at MICCAI 2023. It adapts the CLIP (Contrastive Language-Image Pre-training) paradigm to the medical imaging domain, learning a shared embedding space in which chest X-ray images and their corresponding radiology reports are aligned. Once trained, the model can classify findings on new radiographs in a zero-shot manner—using only natural-language label prompts and no task-specific labeled training data.

The central problem CXR-CLIP addresses is data scarcity in medical imaging. High-quality, expert-annotated chest X-ray datasets are expensive and time-consuming to assemble, which limits the scale at which conventional supervised classifiers can be trained. CXR-CLIP sidesteps this bottleneck by treating routinely generated radiology reports as a weak supervisory signal and by converting existing image-label datasets into image-text pairs through prompt engineering, allowing label-only collections to be folded into language-image pretraining.

The work sits within a growing family of medical CLIP variants (such as ConVIRT, GLoRIA, and BioViL) that bring contrastive image-text learning to radiology. CXR-CLIP's distinguishing contribution is its emphasis on scaling pretraining data and on capturing study-level structure—the fact that a single radiographic study often contains multiple images (e.g., frontal and lateral views) and a report composed of several sections.

#Key Features

  • Zero-shot classification: Classifies pathologies on unseen chest X-rays from text prompts alone, without requiring labeled fine-tuning data for each new task.
  • Image-label to image-text conversion: Uses prompt engineering to recast image-label datasets as image-text pairs, expanding the pool of usable pretraining data beyond report-paired collections.
  • Study-level contrastive learning: Introduces two auxiliary contrastive losses—an image-specific contrastive loss (ICL) across multiple views and a text-specific contrastive loss (TCL) across report sections—to learn characteristics defined at the study level.
  • Multiple encoder backbones: Provides both a ResNet-50 (CNN) and a Swin-Tiny (vision transformer) image encoder, paired with a BioClinicalBERT text encoder.
  • Open weights and code: Pretrained checkpoints and training code are publicly released for both backbones across three dataset configurations.

#Technical Details

CXR-CLIP couples an image encoder (ResNet-50 or Swin-Tiny) with the BioClinicalBERT text encoder, trained with a contrastive InfoNCE-style objective augmented by the image (ICL) and text (TCL) contrastive losses that align multiple views and report sections within a study. The model was pretrained on combinations of large public chest X-ray corpora—MIMIC-CXR, CheXpert, and ChestX-ray14—in three configurations of increasing scale (MIMIC-CXR alone; plus CheXpert; plus ChestX-ray14), where the label-only datasets were converted into text via prompt engineering. The authors report that CXR-CLIP outperforms state-of-the-art methods trained under the same conditions on classification, and that enlarging the pretraining dataset improves classification performance while incurring only a marginal trade-off in image-text retrieval. Evaluation spanned held-out benchmarks including VinDr-CXR, RSNA-Pneumonia, SIIM-Pneumothorax, and OpenI.

#Applications

CXR-CLIP is most useful as a label-efficient backbone for chest radiograph interpretation. Its zero-shot capability lets researchers and clinical teams probe for new findings simply by writing text prompts, which is valuable for rare conditions where labeled data is scarce or for rapidly prototyping triage tools. The learned embeddings also support image-text retrieval—surfacing radiographs similar to a textual query or matching reports to images—and serve as strong pretrained features for downstream fine-tuning on specific classification tasks with limited annotations. The released ResNet-50 and Swin-Tiny checkpoints give practitioners a ready starting point for transfer learning in radiology pipelines.

#Impact

CXR-CLIP contributed to the wave of medical vision-language models demonstrating that contrastive image-report pretraining can rival or exceed supervised baselines while drastically reducing annotation requirements. By open-sourcing both code and pretrained weights under a CC BY-NC 4.0 license, the Kakao Brain team lowered the barrier for other groups to build on study-aware contrastive pretraining for radiography. Its main limitations are those shared across the approach: the model is specialized to chest X-rays and inherits biases from the public training corpora, and the non-commercial license restricts certain downstream uses. Nonetheless, it remains a frequently referenced baseline in medical CLIP research.

Citations

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-43895-0_10

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

Preprint

You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.48550/arXiv.2310.13292

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations133
Influential15
References30

GitHub

Stars121
Forks15
Open Issues2
Contributors2
Last Push1y ago
LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
18Closed
Usability — can I run it?15
Reproducibility — can I retrain it?22
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

bertchest_x_raycnncontrastive_learningimage_text_retrievalmultimodalradiologyrepresentation_learningself_supervisedvision_transformerzero_shot_classification

Resources

GitHub RepositoryResearch Paper