bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

LLaVA-Rad

Microsoft Research

Lightweight 7B vision-language foundation model from Microsoft Research, released research-only under the Microsoft Research License, that generates radiology findings from chest X-rays.

Released: February 2025
Parameters: 7 Billion

LLaVA-Rad is a lightweight, publicly downloadable multimodal foundation model that generates radiology findings from chest X-rays. Given a frontal chest radiograph—and optionally a free-text reason for the exam—the model produces the "findings" section of a radiology report. It was developed by Microsoft Research with collaborators at the University of Washington, Stanford University, and other institutions, and was published in Nature Communications in 2025.

Automated report generation from medical images is a long-standing goal: radiologists face heavy reporting workloads, and draft findings could accelerate review. While large proprietary multimodal models such as GPT-4V and Med-PaLM M (84B parameters) had been applied to this task, they are expensive, closed, and difficult to deploy in clinical settings constrained by privacy and compute. LLaVA-Rad targets this gap with a 7-billion parameter model that runs inference on a single V100 GPU and can be trained on an 8×A100 cluster in roughly one day, making domain adaptation practical for individual institutions.

The work also introduces CheXprompt, an automated GPT-4-based metric for scoring the factual correctness of generated reports against ground truth, addressing the well-known limitation that lexical overlap scores (such as ROUGE) correlate poorly with clinical accuracy.

#Key Features

  • Lightweight and deployable: At 7B parameters, LLaVA-Rad runs inference on a single V100 GPU, lowering the barrier for on-premises clinical research compared to large closed multimodal models.
  • Domain-specific image encoder: It pairs a Vicuna-7B language backbone with BiomedCLIP-CXR, a chest-X-ray-specialized vision encoder built on the BiomedCLIP framework, rather than a general-purpose vision model.
  • Large multi-source training corpus: Trained on 697,435 image-report pairs drawn from seven datasets spanning the US, New Zealand, Brazil, Vietnam, Spain, and China, improving robustness across institutions and populations.
  • Factuality-aware evaluation: The accompanying CheXprompt metric uses GPT-4 to assess clinical correctness of findings, providing a more meaningful signal than lexical-overlap scores.
  • Publicly downloadable, research-only license: Code, model checkpoints, and the evaluation framework are available on GitHub and Hugging Face, but under the non-commercial Microsoft Research License (research use only, no redistribution, not OSI-approved), with additional LLaMA/Vicuna/GPT-4 term dependencies and explicit no-clinical-use terms—not an open-source release. (The HuggingFace "Apache-2.0" badge is misleading; the actual license tag is "other.")

#Technical Details

LLaVA-Rad follows the LLaVA and LLaVA-Med architecture: image features from the BiomedCLIP-CXR vision encoder are projected into the token embedding space of a Vicuna-7B v1.5 language model via a learned projector. Training proceeds in stages, aligning the visual representation to the language model before fine-tuning on chest-X-ray report generation, with the projector and decoder layers trained on MIMIC-CXR data. When only structured labels were available for a source, GPT-4 was used to synthesize report-style text. The 697,435-pair corpus aggregates seven geographically diverse datasets. On standard radiology report-generation benchmarks, LLaVA-Rad outperforms substantially larger models including GPT-4V and Med-PaLM M (84B), establishing state-of-the-art results on report generation and cross-modal retrieval despite its compact size.

#Applications

LLaVA-Rad is intended as a research tool for automated chest-X-ray report drafting, cross-modal retrieval, and as a base model for further domain adaptation by hospitals and academic groups that lack the resources to deploy frontier multimodal systems. Its modest compute footprint makes it suitable for privacy-sensitive, on-premises experimentation. The authors are explicit that the model is for research only and must not be used for direct clinical care or diagnostic decision-making.

#Impact

By demonstrating that a 7B-parameter model can surpass much larger proprietary systems on chest-X-ray reporting, LLaVA-Rad challenged the assumption that medical multimodal performance requires massive scale, and made high-quality radiology report generation accessible to the broader research community. Its release of code, weights, and the CheXprompt factuality metric provides a reusable foundation for benchmarking and extending medical vision-language models. The model sits alongside contemporaneous efforts such as Microsoft's MAIRA series, distinguished primarily by its lightweight and reproducible design—though its research-only Microsoft Research License (which permits no commercial use or redistribution and bars clinical use) and the inherent risks of automated clinical text generation remain important constraints on real-world deployment.

Citation

A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings

Chaves, J. M. Z., et al. (2024) A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nature Communications.

DOI: 10.1038/s41467-025-58344-x

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations63
Influential4
References71

GitHub

Stars58
Forks13
Open Issues12
Contributors2
Last Push4mo ago
LanguagePython

HuggingFace

Downloads1.2K
Likes24
Last Modified26d ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
35Closed
Usability — can I run it?32
Reproducibility — can I retrain it?22
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

chest_x_rayfoundation_modelimage_text_retrievalmultimodalradiologyreport_generationtransformervision_transformer

Resources

GitHub RepositoryResearch PaperHuggingFace Model