LLaVA-Rad

Chest X-ray vision-language model that drafts the findings section of a radiology report, at 7B parameters small enough to run on a single GPU.

Released: February 2025

Parameters: 7 Billion

LLaVA-Rad is a lightweight, publicly downloadable multimodal foundation model that generates radiology findings from chest X-rays. Given a frontal chest radiograph—and optionally a free-text reason for the exam—the model produces the "findings" section of a radiology report. It was developed by Microsoft Research with collaborators at the University of Washington, Stanford University, and other institutions, and was published in Nature Communications in 2025.

Automated report generation from medical images is a long-standing goal: radiologists face heavy reporting workloads, and draft findings could accelerate review. While large proprietary multimodal models such as GPT-4V and Med-PaLM M (84B parameters) had been applied to this task, they are expensive, closed, and difficult to deploy in clinical settings constrained by privacy and compute. LLaVA-Rad targets this gap with a 7-billion parameter model that runs inference on a single V100 GPU and can be trained on an 8×A100 cluster in roughly one day, making domain adaptation practical for individual institutions.

The work also introduces CheXprompt, an automated GPT-4-based metric for scoring the factual correctness of generated reports against ground truth, addressing the well-known limitation that lexical overlap scores (such as ROUGE) correlate poorly with clinical accuracy.

Key Features

Lightweight and deployable: At 7B parameters, LLaVA-Rad runs inference on a single V100 GPU, lowering the barrier for on-premises clinical research compared to large closed multimodal models.
Domain-specific image encoder: It pairs a Vicuna-7B language backbone with BiomedCLIP-CXR, a chest-X-ray-specialized vision encoder built on the BiomedCLIP framework, rather than a general-purpose vision model.
Large multi-source training corpus: Trained on 697,435 image-report pairs drawn from seven datasets spanning the US, New Zealand, Brazil, Vietnam, Spain, and China, improving robustness across institutions and populations.
Factuality-aware evaluation: The accompanying CheXprompt metric uses GPT-4 to assess clinical correctness of findings, providing a more meaningful signal than lexical-overlap scores.
Publicly downloadable, research-only license: Code, model checkpoints, and the evaluation framework are available on GitHub and Hugging Face, but under the non-commercial Microsoft Research License (research use only, no redistribution, not OSI-approved), with additional LLaMA/Vicuna/GPT-4 term dependencies and explicit no-clinical-use terms—not an open-source release. (The HuggingFace "Apache-2.0" badge is misleading; the actual license tag is "other.")

Technical Details

LLaVA-Rad follows the LLaVA and LLaVA-Med architecture: image features from the BiomedCLIP-CXR vision encoder are projected into the token embedding space of a Vicuna-7B v1.5 language model via a learned projector. Training proceeds in stages, aligning the visual representation to the language model before fine-tuning on chest-X-ray report generation, with the projector and decoder layers trained on MIMIC-CXR data. When only structured labels were available for a source, GPT-4 was used to synthesize report-style text. The 697,435-pair corpus aggregates seven geographically diverse datasets. On standard radiology report-generation benchmarks, LLaVA-Rad outperforms substantially larger models including GPT-4V and Med-PaLM M (84B), establishing state-of-the-art results on report generation and cross-modal retrieval despite its compact size.

Applications

LLaVA-Rad is intended as a research tool for automated chest-X-ray report drafting, cross-modal retrieval, and as a base model for further domain adaptation by hospitals and academic groups that lack the resources to deploy frontier multimodal systems. Its modest compute footprint makes it suitable for privacy-sensitive, on-premises experimentation. The authors are explicit that the model is for research only and must not be used for direct clinical care or diagnostic decision-making.

Impact

By demonstrating that a 7B-parameter model can surpass much larger proprietary systems on chest-X-ray reporting, LLaVA-Rad challenged the assumption that medical multimodal performance requires massive scale, and made high-quality radiology report generation accessible to the broader research community. Its release of code, weights, and the CheXprompt factuality metric provides a reusable foundation for benchmarking and extending medical vision-language models. The model sits alongside contemporaneous efforts such as Microsoft's MAIRA series, distinguished primarily by its lightweight and reproducible design—though its research-only Microsoft Research License (which permits no commercial use or redistribution and bars clinical use) and the inherent risks of automated clinical text generation remain important constraints on real-world deployment.

Citation

A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

Chaves, J. M. Z., et al. (2024) A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nature Communications.

DOI: 10.1038/s41467-025-58344-x

Recent citations

Papers that recently cited this model.

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models
Hyunjae Kim, Dain Kim, Pan Xiao, et al.
Jul 2026
0
Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.
Taha Razzaq, Murtaza Taj, Asim Iqbal
Journal of Biomedical Informatics · Jul 2026
0
Benchmarking Multimodal Large Language Models for Cardiopulmonary Findings on Chest Radiographs: Sex-Stratified Discrimination and Operating Characteristics
Matteo Haupt, Arne Bischoff, Myriam Atoubi, et al.
Diagnostics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan, Che Liu, Junde Wu, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Feb 2025
171
MAIRA-2: Grounded Radiology Report Generation
Shruthi Bannur, Kenza Bouzid, D. C. Castro, et al.
arXiv.org · Jun 2024
139Influential
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
L. Blankemeier, J. Cohen, Ashwin Kumar, et al.
Nature · Jun 2024
128
MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging
Noel C. F. Codella, Ying Jin, Shrey Jain, et al.
arXiv.org · Oct 2024
43
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
Qianchu Liu, Sheng Zhang, Guanghui Qin, et al.
arXiv.org · May 2025
28

Citations

Total Citations73

Influential5

References71

GitHub

Stars58

Forks13

Open Issues12

Contributors2

Last Push6mo ago

LanguagePython

HuggingFace

Downloads524

Likes25

Last Modified2mo ago

Fields of citing research

Computer Science94%
Medicine93%
Engineering12%
Psychology3%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

35Closed

Usability — can I run it?32

Reproducibility — can I retrain it?22

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Lightweight and deployable: At 7B parameters, LLaVA-Rad runs inference on a single V100 GPU, lowering the barrier for on-premises clinical research compared to large closed multimodal models.

Domain-specific image encoder: It pairs a Vicuna-7B language backbone with BiomedCLIP-CXR, a chest-X-ray-specialized vision encoder built on the BiomedCLIP framework, rather than a general-purpose vision model.

Large multi-source training corpus: Trained on 697,435 image-report pairs drawn from seven datasets spanning the US, New Zealand, Brazil, Vietnam, Spain, and China, improving robustness across institutions and populations.

Factuality-aware evaluation: The accompanying CheXprompt metric uses GPT-4 to assess clinical correctness of findings, providing a more meaningful signal than lexical-overlap scores.

Publicly downloadable, research-only license: Code, model checkpoints, and the evaluation framework are available on GitHub and Hugging Face, but under the non-commercial Microsoft Research License (research use only, no redistribution, not OSI-approved), with additional LLaMA/Vicuna/GPT-4 term dependencies and explicit no-clinical-use terms—not an open-source release. (The HuggingFace "Apache-2.0" badge is misleading; the actual license tag is "other.")

Technical Details

Applications

Impact

Citation

A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

Chaves, J. M. Z., et al. (2024) A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nature Communications.

DOI: 10.1038/s41467-025-58344-x

LLaVA-Rad

#Key Features

#Technical Details

#Applications

#Impact

Citation

A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

Recent citations

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

LLaVA-Rad

#Key Features

#Technical Details

#Applications

#Impact

Citation

A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

Recent citations

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact