Lightweight 7B vision-language foundation model from Microsoft Research, released research-only under the Microsoft Research License, that generates radiology findings from chest X-rays.
LLaVA-Rad is a lightweight, publicly downloadable multimodal foundation model that generates radiology findings from chest X-rays. Given a frontal chest radiograph—and optionally a free-text reason for the exam—the model produces the "findings" section of a radiology report. It was developed by Microsoft Research with collaborators at the University of Washington, Stanford University, and other institutions, and was published in Nature Communications in 2025.
Automated report generation from medical images is a long-standing goal: radiologists face heavy reporting workloads, and draft findings could accelerate review. While large proprietary multimodal models such as GPT-4V and Med-PaLM M (84B parameters) had been applied to this task, they are expensive, closed, and difficult to deploy in clinical settings constrained by privacy and compute. LLaVA-Rad targets this gap with a 7-billion parameter model that runs inference on a single V100 GPU and can be trained on an 8×A100 cluster in roughly one day, making domain adaptation practical for individual institutions.
The work also introduces CheXprompt, an automated GPT-4-based metric for scoring the factual correctness of generated reports against ground truth, addressing the well-known limitation that lexical overlap scores (such as ROUGE) correlate poorly with clinical accuracy.
LLaVA-Rad follows the LLaVA and LLaVA-Med architecture: image features from the BiomedCLIP-CXR vision encoder are projected into the token embedding space of a Vicuna-7B v1.5 language model via a learned projector. Training proceeds in stages, aligning the visual representation to the language model before fine-tuning on chest-X-ray report generation, with the projector and decoder layers trained on MIMIC-CXR data. When only structured labels were available for a source, GPT-4 was used to synthesize report-style text. The 697,435-pair corpus aggregates seven geographically diverse datasets. On standard radiology report-generation benchmarks, LLaVA-Rad outperforms substantially larger models including GPT-4V and Med-PaLM M (84B), establishing state-of-the-art results on report generation and cross-modal retrieval despite its compact size.
LLaVA-Rad is intended as a research tool for automated chest-X-ray report drafting, cross-modal retrieval, and as a base model for further domain adaptation by hospitals and academic groups that lack the resources to deploy frontier multimodal systems. Its modest compute footprint makes it suitable for privacy-sensitive, on-premises experimentation. The authors are explicit that the model is for research only and must not be used for direct clinical care or diagnostic decision-making.
By demonstrating that a 7B-parameter model can surpass much larger proprietary systems on chest-X-ray reporting, LLaVA-Rad challenged the assumption that medical multimodal performance requires massive scale, and made high-quality radiology report generation accessible to the broader research community. Its release of code, weights, and the CheXprompt factuality metric provides a reusable foundation for benchmarking and extending medical vision-language models. The model sits alongside contemporaneous efforts such as Microsoft's MAIRA series, distinguished primarily by its lightweight and reproducible design—though its research-only Microsoft Research License (which permits no commercial use or redistribution and bars clinical use) and the inherent risks of automated clinical text generation remain important constraints on real-world deployment.
Chaves, J. M. Z., et al. (2024) A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nature Communications.
DOI: 10.1038/s41467-025-58344-xPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data