An instruction-tuned vision-language foundation model from Stanford for interpreting and summarizing chest X-rays across eight clinical task types.
CheXagent is an instruction-tuned vision-language foundation model developed by Stanford University's AIMI (Artificial Intelligence in Medicine and Imaging) center to streamline the interpretation of chest X-rays (CXRs), the most commonly performed medical imaging exam worldwide. Introduced in January 2024, it tackles a persistent bottleneck in radiology: building generalist models that can read CXRs is hampered by the scarcity of large vision-language CXR datasets, the lack of a clinical language model able to parse radiology reports, and the absence of standardized benchmarks for fair evaluation.
The Stanford team addressed all three gaps together. They curated CheXinstruct, a large-scale instruction-tuning corpus assembled from 28 publicly available datasets, trained an 8-billion-parameter model that couples a clinical text decoder with a CXR-specialized vision encoder, and released CheXbench, an evaluation framework spanning eight clinically meaningful task types ranging from image perception to textual understanding. CheXagent sits alongside contrastive CXR models such as CheXzero and CXR-CLIP but is distinguished by its generative, instruction-following design that produces free-text answers and report drafts rather than fixed labels.
Quantitative evaluations and qualitative review by five expert radiologists showed CheXagent outperforming previously developed general-domain and medical-domain foundation models on CheXbench. A later revision extended the work with a clinical reader study and additional model variants (the CheXagent-2 byproducts), measuring real-world impact on report-writing efficiency.
CheXagent is an 8-billion-parameter vision-language model. Construction proceeded in stages: the authors first trained a clinical large language model to parse radiology reports, then trained a vision encoder to represent CXR images, and finally trained a bridging network and instruction-tuned the full system on CheXinstruct. CheXinstruct is curated from 28 publicly available datasets (including sources such as MIMIC-CXR), giving broad coverage of CXR appearances and report styles. The model produces free-text outputs and supports zero-shot and few-shot prompting across heterogeneous tasks. On CheXbench — eight task types including view and disease classification, findings/impression generation, and visual question answering — CheXagent surpasses prior general- and medical-domain foundation models in expert evaluation. Subsequent CheXagent-2 byproducts released by the lab include smaller variants built on a SigLIP-based vision encoder and the RadPhi-2 clinical decoder. Models and code are released for research use only and are explicitly not intended for clinical deployment.
CheXagent targets radiology workflows where chest X-ray volume creates reporting backlogs and turnaround pressure. It can draft structured findings and impressions for radiologist review, answer clinical questions about an image, classify views and abnormalities, and serve as a research backbone for downstream CXR tasks. The Stanford reader study demonstrated tangible benefit to trainees and attending radiologists by accelerating report writing while preserving diagnostic quality, suggesting value as an assistive drafting tool. Researchers also benefit from CheXinstruct and CheXbench as shared resources for training and benchmarking new CXR models.
By releasing the model, the CheXinstruct dataset, and the CheXbench benchmark together, CheXagent provided the CXR community with an integrated, reproducible foundation for generative chest X-ray interpretation and helped establish instruction-tuned vision-language models as a competitive paradigm in medical imaging. The science is openly licensed — the arXiv paper, architecture, and CheXbench results are CC-BY-4.0 — but the released weights (StanfordAIMI/CheXagent-8b) and the GitHub code are governed by a non-commercial, research-use-only license (CC-BY-NC-ND), so they are not freely reusable for commercial or derivative work. Together with the lab's follow-on CheXagent-2 variants, it has become a widely referenced reference point for evaluating radiology foundation models. The chief limitation is that the model is validated for research only and not approved for clinical use; like other generative report tools it requires expert oversight to guard against hallucinated or incorrect findings.
Chen, Z., et al. (2024) A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation.
DOI: 10.48550/arXiv.2401.12208Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data