bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

CXR-LLaVA

Seoul National University / Gwangju Institute of Science and Technology

A publicly available vision-language model that interprets chest X-rays and generates radiology reports, built on a CXR-specific image encoder and LLaMA-2 (non-commercial license).

Released: October 2023
Parameters: 7 Billion

CXR-LLaVA is a publicly available multimodal large language model that interprets chest radiographs (CXRs) and produces free-text radiology reports. Developed by radiologists at Seoul National University Hospital together with AI researchers at the Gwangju Institute of Science and Technology, it was first released as a preprint in October 2023 and published in European Radiology in 2025. The model adapts the LLaVA (Large Language and Vision Assistant) recipe to the radiology domain, pairing a chest-X-ray-specific image encoder with a general-purpose language model so that a single system can describe findings, answer questions, and draft structured reports from an input image.

The central problem CXR-LLaVA addresses is that general-purpose vision-language models — including GPT-4-Vision and Gemini-Pro-Vision at the time of writing — perform poorly on chest radiographs because their image encoders were never exposed to large volumes of radiology data. CXR-LLaVA tackles this by first pretraining its vision encoder on hundreds of thousands of labeled CXRs, giving the downstream language model a representation that already captures clinically meaningful imaging features such as consolidation, effusion, cardiomegaly, and pneumothorax.

Because the authors released code, model weights, and a public demo — albeit under a non-commercial CC-BY-NC-4.0 license plus the LLaMA-2 community license, so usage is restricted to research and non-commercial settings — CXR-LLaVA became one of the more accessible reference implementations for radiology-specific multimodal LLMs, sitting alongside related efforts such as LLaVA-Rad and other report-generation systems in the medical imaging landscape.

#Key Features

  • CXR-specific vision encoder: The image encoder is pretrained on labeled chest radiographs before instruction tuning, so the model starts from representations tuned to thoracic pathology rather than natural images.
  • Report generation and dialogue: It generates full radiologic reports, offers differential diagnoses, and supports interactive visual question answering about a given chest X-ray.
  • Publicly available weights and demo: Code, checkpoints, and a hosted web demo are publicly accessible under a non-commercial CC-BY-NC-4.0 plus LLaMA-2 license, enabling reproducible evaluation and research use without retraining.
  • Strong reported accuracy: On detection of major radiographic findings it reported F1 scores of 0.81 (internal test) and 0.62 (external validation), exceeding GPT-4-Vision and Gemini-Pro-Vision on the same tasks.
  • Research-only licensing: Released under a Creative Commons non-commercial license and dependent on the LLaMA-2 license, it is intended for research rather than clinical decision-making.

#Technical Details

The latest version (v2) couples a ViT-L/16 vision transformer encoder with a LLaMA-2-7B-Chat language backbone, processing grayscale CXR images at 512x512 resolution. Training used roughly 592,580–659,287 publicly available chest radiographs aggregated from open datasets including CheXpert, MIMIC-CXR, NIH ChestX-ray, PadChest, VinDr-CXR, BrixIA, and the RSNA COVID-19 detection challenge; of these, several hundred thousand carried abnormality labels and over 200,000 included free-text reports. Training proceeded in stages: vision-encoder pretraining on labeled images, followed by image-text alignment and instruction tuning on report data. In a reader study, board-certified radiologists judged that the model produced acceptable autonomous reports in 72.7% of cases.

#Applications

CXR-LLaVA targets radiology research workflows where automated chest-X-ray interpretation is useful: drafting preliminary reports to reduce reporting burden, serving as a teaching and second-read aid, powering visual question answering over radiographs, and providing a reproducible baseline for groups building or benchmarking medical multimodal LLMs. Because weights and a demo are publicly available for non-commercial use, both clinical-AI researchers and machine-learning practitioners can evaluate it directly or fine-tune it for downstream radiology tasks. The authors explicitly caution against unvalidated clinical use.

#Impact

CXR-LLaVA demonstrated that domain-specific pretraining of the vision encoder is key to making LLaVA-style models effective on medical images, and its open release made it a practical reference point for radiology vision-language research. By outperforming leading general-purpose multimodal models on chest-X-ray findings and publishing in a major radiology journal, it helped establish report generation as a credible benchmark task for medical foundation models. Its main limitations are its non-commercial license, restriction to single-view grayscale CXRs at fixed resolution, and the usual caveats around hallucination and numerical reliability that accompany report-generating LLMs.

Citation

CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images

Lee, S., et al. (2025) CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. European Radiology.

DOI: 10.1007/s00330-024-11339-6

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations31
Influential1
References28

GitHub

Stars54
Forks5
Open Issues4
Contributors1
Last Push2y ago
LanguagePython

HuggingFace

Downloads362
Likes6
Last Modified2y ago
Pipelineimage-feature-extraction

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
27Closed
Usability — can I run it?29
Reproducibility — can I retrain it?18
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

abnormality_classificationchest_x_rayinstruction_tuningmedical_visual_question_answeringmultimodalradiologyradiology_report_generationtransformervision_transformer

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDemo