CXR-LLaVA

Seoul National University / Gwangju Institute of Science and Technology

Chest X-ray vision-language model that generates free-text radiology reports, pairing a CXR-specific image encoder with a 7B LLaMA-2 language model.

Released: October 2023

Parameters: 7 Billion

CXR-LLaVA is a publicly available multimodal large language model that interprets chest radiographs (CXRs) and produces free-text radiology reports. Developed by radiologists at Seoul National University Hospital together with AI researchers at the Gwangju Institute of Science and Technology, it was first released as a preprint in October 2023 and published in European Radiology in 2025. The model adapts the LLaVA (Large Language and Vision Assistant) recipe to the radiology domain, pairing a chest-X-ray-specific image encoder with a general-purpose language model so that a single system can describe findings, answer questions, and draft structured reports from an input image.

The central problem CXR-LLaVA addresses is that general-purpose vision-language models — including GPT-4-Vision and Gemini-Pro-Vision at the time of writing — perform poorly on chest radiographs because their image encoders were never exposed to large volumes of radiology data. CXR-LLaVA tackles this by first pretraining its vision encoder on hundreds of thousands of labeled CXRs, giving the downstream language model a representation that already captures clinically meaningful imaging features such as consolidation, effusion, cardiomegaly, and pneumothorax.

Because the authors released code, model weights, and a public demo — albeit under a non-commercial CC-BY-NC-4.0 license plus the LLaMA-2 community license, so usage is restricted to research and non-commercial settings — CXR-LLaVA became one of the more accessible reference implementations for radiology-specific multimodal LLMs, sitting alongside related efforts such as LLaVA-Rad and other report-generation systems in the medical imaging landscape.

Key Features

CXR-specific vision encoder: The image encoder is pretrained on labeled chest radiographs before instruction tuning, so the model starts from representations tuned to thoracic pathology rather than natural images.
Report generation and dialogue: It generates full radiologic reports, offers differential diagnoses, and supports interactive visual question answering about a given chest X-ray.
Publicly available weights and demo: Code, checkpoints, and a hosted web demo are publicly accessible under a non-commercial CC-BY-NC-4.0 plus LLaMA-2 license, enabling reproducible evaluation and research use without retraining.
Strong reported accuracy: On detection of major radiographic findings it reported F1 scores of 0.81 (internal test) and 0.62 (external validation), exceeding GPT-4-Vision and Gemini-Pro-Vision on the same tasks.
Research-only licensing: Released under a Creative Commons non-commercial license and dependent on the LLaMA-2 license, it is intended for research rather than clinical decision-making.

Technical Details

The latest version (v2) couples a ViT-L/16 vision transformer encoder with a LLaMA-2-7B-Chat language backbone, processing grayscale CXR images at 512x512 resolution. Training used roughly 592,580–659,287 publicly available chest radiographs aggregated from open datasets including CheXpert, MIMIC-CXR, NIH ChestX-ray, PadChest, VinDr-CXR, BrixIA, and the RSNA COVID-19 detection challenge; of these, several hundred thousand carried abnormality labels and over 200,000 included free-text reports. Training proceeded in stages: vision-encoder pretraining on labeled images, followed by image-text alignment and instruction tuning on report data. In a reader study, board-certified radiologists judged that the model produced acceptable autonomous reports in 72.7% of cases.

Applications

CXR-LLaVA targets radiology research workflows where automated chest-X-ray interpretation is useful: drafting preliminary reports to reduce reporting burden, serving as a teaching and second-read aid, powering visual question answering over radiographs, and providing a reproducible baseline for groups building or benchmarking medical multimodal LLMs. Because weights and a demo are publicly available for non-commercial use, both clinical-AI researchers and machine-learning practitioners can evaluate it directly or fine-tune it for downstream radiology tasks. The authors explicitly caution against unvalidated clinical use.

Impact

CXR-LLaVA demonstrated that domain-specific pretraining of the vision encoder is key to making LLaVA-style models effective on medical images, and its open release made it a practical reference point for radiology vision-language research. By outperforming leading general-purpose multimodal models on chest-X-ray findings and publishing in a major radiology journal, it helped establish report generation as a credible benchmark task for medical foundation models. Its main limitations are its non-commercial license, restriction to single-view grayscale CXRs at fixed resolution, and the usual caveats around hallucination and numerical reliability that accompany report-generating LLMs.

Citation

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

Lee, S., et al. (2025) CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. European Radiology.

DOI: 10.1007/s00330-024-11339-6

Recent citations

Papers that recently cited this model.

Language-Guided Segmentation of Medical Images: A Review of Foundation Models
Saqib Qamar
Bioengineering · Jul 2026
0
Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.
Taha Razzaq, Murtaza Taj, Asim Iqbal
Journal of Biomedical Informatics · Jul 2026
0
Benchmarking Multimodal Large Language Models for Cardiopulmonary Findings on Chest Radiographs: Sex-Stratified Discrimination and Operating Characteristics
Matteo Haupt, Arne Bischoff, Myriam Atoubi, et al.
Diagnostics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Generalist foundation models from a multimodal dataset for 3D computed tomography.
I. Hamamci, Sezgin Er, Furkan Almas, et al.
Nature Biomedical Engineering · Mar 2024
177
Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study
Zhiwu Lin, Yuanyuan Li, Min-chia Wu, et al.
Clinical and Experimental Medicine (Testo stampato) · Jun 2025
12
Primer on large language models: an educational overview for intensivists
Daphna Idan, Sharon Einav
Critical Care · Jun 2025
9
Can Large Language Models Challenge CNNS in Medical Image Analysis?
Shibbir Ahmed, Shahnewaz Karim Sakib, Anindya Bijoy Das
International Conference on Information Photonics · May 2025
7
MoMA: a mixture-of-multimodal-agents architecture for enhancing clinical prediction modelling
Jifan Gao, Mahmudur Rahman, J. Caskey, et al.
npj Digital Medicine · Aug 2025
4

Citations

Total Citations39

Influential2

References28

GitHub

Stars54

Forks5

Open Issues4

Contributors1

Last Push2y ago

LanguagePython

HuggingFace

Downloads189

Likes6

Last Modified2y ago

Pipelinefeature-extraction

Fields of citing research

Medicine97%
Computer Science82%
Engineering8%
Education3%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

27Closed

Usability — can I run it?29

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Demo

Key Features

CXR-specific vision encoder: The image encoder is pretrained on labeled chest radiographs before instruction tuning, so the model starts from representations tuned to thoracic pathology rather than natural images.

Report generation and dialogue: It generates full radiologic reports, offers differential diagnoses, and supports interactive visual question answering about a given chest X-ray.

Publicly available weights and demo: Code, checkpoints, and a hosted web demo are publicly accessible under a non-commercial CC-BY-NC-4.0 plus LLaMA-2 license, enabling reproducible evaluation and research use without retraining.

Strong reported accuracy: On detection of major radiographic findings it reported F1 scores of 0.81 (internal test) and 0.62 (external validation), exceeding GPT-4-Vision and Gemini-Pro-Vision on the same tasks.

Research-only licensing: Released under a Creative Commons non-commercial license and dependent on the LLaMA-2 license, it is intended for research rather than clinical decision-making.

Technical Details

Applications

Impact

CXR-LLaVA

#Key Features

#Technical Details

#Applications

#Impact

Citation

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

CXR-LLaVA

#Key Features

#Technical Details

#Applications

#Impact

Citation

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact