Large-scale chest X-ray vision-language pretraining model that learns image-report alignment for zero-shot and few-shot radiograph classification.
CXR-CLIP is a vision-language foundation model for chest radiography, developed by researchers at Kakao Brain (a research subsidiary of Kakao Corporation) and published at MICCAI 2023. It adapts the CLIP (Contrastive Language-Image Pre-training) paradigm to the medical imaging domain, learning a shared embedding space in which chest X-ray images and their corresponding radiology reports are aligned. Once trained, the model can classify findings on new radiographs in a zero-shot manner—using only natural-language label prompts and no task-specific labeled training data.
The central problem CXR-CLIP addresses is data scarcity in medical imaging. High-quality, expert-annotated chest X-ray datasets are expensive and time-consuming to assemble, which limits the scale at which conventional supervised classifiers can be trained. CXR-CLIP sidesteps this bottleneck by treating routinely generated radiology reports as a weak supervisory signal and by converting existing image-label datasets into image-text pairs through prompt engineering, allowing label-only collections to be folded into language-image pretraining.
The work sits within a growing family of medical CLIP variants (such as ConVIRT, GLoRIA, and BioViL) that bring contrastive image-text learning to radiology. CXR-CLIP's distinguishing contribution is its emphasis on scaling pretraining data and on capturing study-level structure—the fact that a single radiographic study often contains multiple images (e.g., frontal and lateral views) and a report composed of several sections.
CXR-CLIP couples an image encoder (ResNet-50 or Swin-Tiny) with the BioClinicalBERT text encoder, trained with a contrastive InfoNCE-style objective augmented by the image (ICL) and text (TCL) contrastive losses that align multiple views and report sections within a study. The model was pretrained on combinations of large public chest X-ray corpora—MIMIC-CXR, CheXpert, and ChestX-ray14—in three configurations of increasing scale (MIMIC-CXR alone; plus CheXpert; plus ChestX-ray14), where the label-only datasets were converted into text via prompt engineering. The authors report that CXR-CLIP outperforms state-of-the-art methods trained under the same conditions on classification, and that enlarging the pretraining dataset improves classification performance while incurring only a marginal trade-off in image-text retrieval. Evaluation spanned held-out benchmarks including VinDr-CXR, RSNA-Pneumonia, SIIM-Pneumothorax, and OpenI.
CXR-CLIP is most useful as a label-efficient backbone for chest radiograph interpretation. Its zero-shot capability lets researchers and clinical teams probe for new findings simply by writing text prompts, which is valuable for rare conditions where labeled data is scarce or for rapidly prototyping triage tools. The learned embeddings also support image-text retrieval—surfacing radiographs similar to a textual query or matching reports to images—and serve as strong pretrained features for downstream fine-tuning on specific classification tasks with limited annotations. The released ResNet-50 and Swin-Tiny checkpoints give practitioners a ready starting point for transfer learning in radiology pipelines.
CXR-CLIP contributed to the wave of medical vision-language models demonstrating that contrastive image-report pretraining can rival or exceed supervised baselines while drastically reducing annotation requirements. By open-sourcing both code and pretrained weights under a CC BY-NC 4.0 license, the Kakao Brain team lowered the barrier for other groups to build on study-aware contrastive pretraining for radiography. Its main limitations are those shared across the approach: the model is specialized to chest X-rays and inherits biases from the public training corpora, and the non-commercial license restricts certain downstream uses. Nonetheless, it remains a frequently referenced baseline in medical CLIP research.
You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.1007/978-3-031-43895-0_10You, K., et al. (2023) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.48550/arXiv.2310.13292Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data