Self-supervised vision-language model for zero-shot detection of chest X-ray pathologies, trained on image-report pairs without explicit labels.
CheXzero is a self-supervised, vision-language model that detects pathologies in chest X-rays without ever being trained on explicit pathology labels. Developed by Ekin Tiu, Pranav Rajpurkar, and colleagues at Stanford University and published in Nature Biomedical Engineering in September 2022, it adapts the contrastive language-image pre-training (CLIP) paradigm to radiology by learning directly from raw chest radiographs paired with their free-text clinical reports.
The central problem CheXzero addresses is the labeling bottleneck in medical imaging AI. Conventional supervised classifiers require large datasets annotated by experts for each target pathology, an expensive and time-consuming process that limits how many conditions a model can recognize. By learning the joint structure of images and the natural-language descriptions radiologists already write, CheXzero instead performs zero-shot classification: at inference time it scores an image against text prompts (for example, "pulmonary edema" versus "no pulmonary edema") and can flag findings it was never explicitly trained to detect.
This made CheXzero a landmark demonstration that self-supervised, report-driven pretraining can reach expert-level performance on chest X-ray interpretation, influencing a subsequent wave of CLIP-style medical foundation models.
CheXzero adapts the CLIP dual-encoder architecture, pairing a ViT-B/32 Vision Transformer image encoder with a 12-layer, 63-million-parameter Transformer text encoder. The two encoders are trained with a contrastive objective that aligns each chest X-ray with its corresponding radiology report in a shared embedding space. Training used 377,110 image-report pairs from the MIMIC-CXR dataset, with the "impression" section of reports extracted as the text supervision signal. On the CheXpert competition test set, an ensemble of CheXzero models achieved a mean AUC of 0.889 across five pathologies (atelectasis, cardiomegaly, consolidation, edema, and pleural effusion), within 0.042 of the top fully supervised method (Deep AUC Maximization, 0.931) despite using no labels. On PadChest, it achieved AUC > 0.9 on 14 findings and AUC ≥ 0.700 on 53 of 107 radiographic findings, including many not present during training.
CheXzero is most useful where labeled training data is scarce or where the set of clinically relevant findings is broad and evolving. Because it classifies via text prompts, radiologists and researchers can query new or rare pathologies without retraining, supporting rapid prototyping of triage and decision-support tools, retrospective dataset curation, and screening workflows in resource-limited settings. Its label-free design also lowers the barrier for institutions that hold large archives of reports and images but lack the annotation budget to build supervised models for each condition.
CheXzero was an influential proof that self-supervised, report-supervised pretraining could match radiologists on chest X-ray interpretation, and it helped catalyze the adoption of CLIP-style contrastive learning across medical imaging. As a highly cited, openly released model, it became a common baseline and starting point for later chest-radiograph foundation models and vision-language systems in healthcare. Important limitations remain: performance depends on the quality and phrasing of text prompts, the training data derives largely from single-institution sources that may not reflect all populations or imaging conditions, and the model is a research artifact rather than a regulatory-cleared clinical device, so prospective validation is required before deployment.
Tiu, E., et al. (2022) Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering.
DOI: 10.1038/s41551-022-00936-9Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data