bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

CheXagent

Stanford University

An instruction-tuned vision-language foundation model from Stanford for interpreting and summarizing chest X-rays across eight clinical task types.

Released: January 2024
Parameters: 8 Billion

CheXagent is an instruction-tuned vision-language foundation model developed by Stanford University's AIMI (Artificial Intelligence in Medicine and Imaging) center to streamline the interpretation of chest X-rays (CXRs), the most commonly performed medical imaging exam worldwide. Introduced in January 2024, it tackles a persistent bottleneck in radiology: building generalist models that can read CXRs is hampered by the scarcity of large vision-language CXR datasets, the lack of a clinical language model able to parse radiology reports, and the absence of standardized benchmarks for fair evaluation.

The Stanford team addressed all three gaps together. They curated CheXinstruct, a large-scale instruction-tuning corpus assembled from 28 publicly available datasets, trained an 8-billion-parameter model that couples a clinical text decoder with a CXR-specialized vision encoder, and released CheXbench, an evaluation framework spanning eight clinically meaningful task types ranging from image perception to textual understanding. CheXagent sits alongside contrastive CXR models such as CheXzero and CXR-CLIP but is distinguished by its generative, instruction-following design that produces free-text answers and report drafts rather than fixed labels.

Quantitative evaluations and qualitative review by five expert radiologists showed CheXagent outperforming previously developed general-domain and medical-domain foundation models on CheXbench. A later revision extended the work with a clinical reader study and additional model variants (the CheXagent-2 byproducts), measuring real-world impact on report-writing efficiency.

#Key Features

  • Instruction-following CXR interpretation: A single checkpoint handles diverse tasks — view classification, disease identification, findings/impression generation, and visual question answering — by responding to natural-language instructions.
  • Purpose-built clinical components: The system pairs a language decoder adapted for parsing radiology reports with a vision encoder specialized for representing CXR images, bridged by a connector network trained for the medical domain.
  • CheXinstruct training corpus: A large-scale instruction-tuning dataset compiled from 28 public CXR datasets, designed to teach the model the full breadth of interpretation subtasks.
  • CheXbench benchmark: A standardized evaluation suite covering eight task types across image perception and textual understanding, enabling reproducible, head-to-head comparison of CXR foundation models.
  • Clinical efficiency gains: In a reader study, residents drafting reports with CheXagent assistance achieved roughly a 36% time saving, with improved writing efficiency in 81% of resident and 61% of attending cases without compromising quality.

#Technical Details

CheXagent is an 8-billion-parameter vision-language model. Construction proceeded in stages: the authors first trained a clinical large language model to parse radiology reports, then trained a vision encoder to represent CXR images, and finally trained a bridging network and instruction-tuned the full system on CheXinstruct. CheXinstruct is curated from 28 publicly available datasets (including sources such as MIMIC-CXR), giving broad coverage of CXR appearances and report styles. The model produces free-text outputs and supports zero-shot and few-shot prompting across heterogeneous tasks. On CheXbench — eight task types including view and disease classification, findings/impression generation, and visual question answering — CheXagent surpasses prior general- and medical-domain foundation models in expert evaluation. Subsequent CheXagent-2 byproducts released by the lab include smaller variants built on a SigLIP-based vision encoder and the RadPhi-2 clinical decoder. Models and code are released for research use only and are explicitly not intended for clinical deployment.

#Applications

CheXagent targets radiology workflows where chest X-ray volume creates reporting backlogs and turnaround pressure. It can draft structured findings and impressions for radiologist review, answer clinical questions about an image, classify views and abnormalities, and serve as a research backbone for downstream CXR tasks. The Stanford reader study demonstrated tangible benefit to trainees and attending radiologists by accelerating report writing while preserving diagnostic quality, suggesting value as an assistive drafting tool. Researchers also benefit from CheXinstruct and CheXbench as shared resources for training and benchmarking new CXR models.

#Impact

By releasing the model, the CheXinstruct dataset, and the CheXbench benchmark together, CheXagent provided the CXR community with an integrated, reproducible foundation for generative chest X-ray interpretation and helped establish instruction-tuned vision-language models as a competitive paradigm in medical imaging. The science is openly licensed — the arXiv paper, architecture, and CheXbench results are CC-BY-4.0 — but the released weights (StanfordAIMI/CheXagent-8b) and the GitHub code are governed by a non-commercial, research-use-only license (CC-BY-NC-ND), so they are not freely reusable for commercial or derivative work. Together with the lab's follow-on CheXagent-2 variants, it has become a widely referenced reference point for evaluating radiology foundation models. The chief limitation is that the model is validated for research only and not approved for clinical use; like other generative report tools it requires expert oversight to guard against hallucinated or incorrect findings.

Citation

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Preprint

Chen, Z., et al. (2024) A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation.

DOI: 10.48550/arXiv.2401.12208

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations70
Influential11
References0

GitHub

Stars226
Forks27
Open Issues5
Contributors1
Last Push1y ago
LanguagePython

HuggingFace

Downloads1.3K
Likes46
Last Modified2y ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
32Closed
Usability — can I run it?34
Reproducibility — can I retrain it?11
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

chest_x_rayfoundation_modelimage_classificationinstruction_tuningmultimodalradiologyreport_generationtransformervision_language_modelvisual_question_answering

Resources

GitHub RepositoryResearch PaperOfficial WebsiteHuggingFace ModelDataset