Seoul National University / Gwangju Institute of Science and Technology
A publicly available vision-language model that interprets chest X-rays and generates radiology reports, built on a CXR-specific image encoder and LLaMA-2 (non-commercial license).
CXR-LLaVA is a publicly available multimodal large language model that interprets chest radiographs (CXRs) and produces free-text radiology reports. Developed by radiologists at Seoul National University Hospital together with AI researchers at the Gwangju Institute of Science and Technology, it was first released as a preprint in October 2023 and published in European Radiology in 2025. The model adapts the LLaVA (Large Language and Vision Assistant) recipe to the radiology domain, pairing a chest-X-ray-specific image encoder with a general-purpose language model so that a single system can describe findings, answer questions, and draft structured reports from an input image.
The central problem CXR-LLaVA addresses is that general-purpose vision-language models — including GPT-4-Vision and Gemini-Pro-Vision at the time of writing — perform poorly on chest radiographs because their image encoders were never exposed to large volumes of radiology data. CXR-LLaVA tackles this by first pretraining its vision encoder on hundreds of thousands of labeled CXRs, giving the downstream language model a representation that already captures clinically meaningful imaging features such as consolidation, effusion, cardiomegaly, and pneumothorax.
Because the authors released code, model weights, and a public demo — albeit under a non-commercial CC-BY-NC-4.0 license plus the LLaMA-2 community license, so usage is restricted to research and non-commercial settings — CXR-LLaVA became one of the more accessible reference implementations for radiology-specific multimodal LLMs, sitting alongside related efforts such as LLaVA-Rad and other report-generation systems in the medical imaging landscape.
The latest version (v2) couples a ViT-L/16 vision transformer encoder with a LLaMA-2-7B-Chat language backbone, processing grayscale CXR images at 512x512 resolution. Training used roughly 592,580–659,287 publicly available chest radiographs aggregated from open datasets including CheXpert, MIMIC-CXR, NIH ChestX-ray, PadChest, VinDr-CXR, BrixIA, and the RSNA COVID-19 detection challenge; of these, several hundred thousand carried abnormality labels and over 200,000 included free-text reports. Training proceeded in stages: vision-encoder pretraining on labeled images, followed by image-text alignment and instruction tuning on report data. In a reader study, board-certified radiologists judged that the model produced acceptable autonomous reports in 72.7% of cases.
CXR-LLaVA targets radiology research workflows where automated chest-X-ray interpretation is useful: drafting preliminary reports to reduce reporting burden, serving as a teaching and second-read aid, powering visual question answering over radiographs, and providing a reproducible baseline for groups building or benchmarking medical multimodal LLMs. Because weights and a demo are publicly available for non-commercial use, both clinical-AI researchers and machine-learning practitioners can evaluate it directly or fine-tune it for downstream radiology tasks. The authors explicitly caution against unvalidated clinical use.
CXR-LLaVA demonstrated that domain-specific pretraining of the vision encoder is key to making LLaVA-style models effective on medical images, and its open release made it a practical reference point for radiology vision-language research. By outperforming leading general-purpose multimodal models on chest-X-ray findings and publishing in a major radiology journal, it helped establish report generation as a credible benchmark task for medical foundation models. Its main limitations are its non-commercial license, restriction to single-view grayscale CXRs at fixed resolution, and the usual caveats around hallucination and numerical reliability that accompany report-generating LLMs.
Lee, S., et al. (2025) CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. European Radiology.
DOI: 10.1007/s00330-024-11339-6Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data