LLaVA-Med

Biomedical vision-language assistant for question answering on radiology and pathology images, adapted from LLaVA on PubMed Central captions.

Released: June 2023

Parameters: 7 Billion

LLaVA-Med (Large Language-and-Vision Assistant for Biomedicine) is a multimodal conversational model that answers open-ended questions about biomedical images such as histopathology slides, radiographs, CT and MRI scans, and gross pathology photographs. Introduced by Microsoft Research in June 2023 and published at NeurIPS 2023, it adapts the general-domain LLaVA vision-language assistant to medicine, aiming to bring GPT-4-style multimodal conversation into a domain where general models struggle because they have never seen the specialized vocabulary or imaging modalities.

The central contribution is a cost-efficient recipe for domain adaptation. Rather than training from scratch, the authors leverage PMC-15M—a large-scale collection of 15 million figure-caption pairs extracted from PubMed Central articles—and use GPT-4 to generate biomedical instruction-following dialogue from the captions. The full medical assistant can be trained in under 15 hours on eight A100 GPUs, which the paper's title highlights as training "in one day." This made high-quality biomedical multimodal assistants reproducible on modest academic budgets.

LLaVA-Med sits at the intersection of biomedical pathology/imaging and large language models, and it became an influential reference point for the wave of open biomedical vision-language models that followed. A later v1.5 checkpoint rebuilt the assistant on Mistral-7B-Instruct.

Key Features

Curriculum (two-stage) training: The model first aligns biomedical visual concepts with language using figure-caption pairs, then learns conversational reasoning from GPT-4-generated instruction data—mirroring how a person studies a new field before applying it.
GPT-4-generated instruction data: Self-instruct-style multimodal dialogues are synthesized from PubMed Central captions, removing the need for costly expert annotation to build instruction-following data.
Broad modality coverage: Training figures span histology, radiology (X-ray, CT, MRI), gross pathology, and other common biomedical image types.
Efficient adaptation: Full training completes in under 15 hours on eight A100 GPUs, making the recipe accessible to academic labs.
Open release: Code, instruction-tuning data, and model weights are released, with an updated LLaVA-Med v1.5 checkpoint built on Mistral-7B available on Hugging Face.

Technical Details

LLaVA-Med inherits the LLaVA architecture: a CLIP-style vision encoder produces image features that a projection layer maps into the embedding space of a large language model (Vicuna in the original release; Mistral-7B-Instruct-v0.2 in v1.5, roughly 7B parameters). Stage one (concept alignment) trains only the projection on roughly 500K figure-caption pairs sampled from PMC-15M, teaching the model to ground biomedical visual concepts. Stage two fine-tunes on about 60K GPT-4- generated multi-round instruction-following conversations, giving the assistant open-ended dialogue ability. On three established biomedical visual question answering benchmarks—VQA-RAD, SLAKE, and PathVQA—a fine-tuned LLaVA-Med matches or exceeds prior supervised state-of-the-art on several metrics, particularly for open-set (free-form) questions. The repository also ships a GPT-assisted evaluation pipeline and a Gradio web UI.

Applications

LLaVA-Med is intended as a research tool and conversational assistant for biomedical image understanding: answering questions about pathology and radiology figures, captioning biomedical images, and serving as a strong baseline or starting checkpoint for downstream medical VQA and report-generation systems. Researchers benefit from its open weights and data recipe, which let them reproduce or extend domain-specific multimodal assistants. Microsoft explicitly restricts the released model to research and reproducibility, prohibiting clinical care or clinical decision-making use.

Impact

LLaVA-Med demonstrated that a capable biomedical multimodal assistant could be built cheaply by combining web-scale figure-caption data with synthetic GPT-4-generated instructions, and its recipe was widely adopted and extended by subsequent open biomedical vision-language models. As one of the earliest openly released medical instruction-tuned multimodal models, it became a common baseline in medical VQA research. Its main limitations are honestly noted by the authors: it is English-only, evaluated on a narrow set of benchmarks, can hallucinate, may inherit biases from the academic-publication distribution of PMC-15M, and is not validated or approved for clinical use.

Citation

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Preprint

Li, C., et al. (2023) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. Neural Information Processing Systems.

DOI: 10.48550/arXiv.2306.00890

Recent citations

Papers that recently cited this model.

Hierarchical multi-type annotation fusion with uncertainty-aware cross-attention for chest X-ray classification
S. Thota, Fayadh S. Alenezi, Kemal Polat, et al.
Applied Soft Computing · Oct 2026
0
Beyond textual rationales: Anatomy-grounded chain-of-thought for traceable radiology reasoning
Shengzhi Wang, Kai Wu, Jun Yang, et al.
Knowledge-Based Systems · Sep 2026
0
HAL: Accurate, Private, and Efficient Sample Alignment for Multimodal Federated Learning
Xiaokai Zhou, Xiao Yan, Xinyan Li, et al.
2026
0

Top citations

The most-cited papers that cite this model.

Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, et al.
Computer Vision and Pattern Recognition · Oct 2023
5.2K
A survey on multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, et al.
National Science Review · Jun 2023
1.4KInfluential
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, et al.
arXiv.org · Jun 2024
760
DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model
Zhenhua Xu, Yujia Zhang, Enze Xie, et al.
IEEE Robotics and Automation Letters · Oct 2023
655
Med-Flamingo: a Multimodal Medical Few-shot Learner
Michael Moor, Qian Huang, Shirley Wu, et al.
ML4H@NeurIPS · Jul 2023
566Influential

Citations

Total Citations1.9K

Influential240

References56

GitHub

Stars2.2K

Forks292

Open Issues107

Contributors7

Last Push1y ago

LanguagePython

HuggingFace

Downloads12.2K

Likes125

Last Modified8mo ago

Pipelineimage-text-to-text

Fields of citing research

Computer Science27%
Medicine19%
Engineering4%
Biology1%
Linguistics1%
Environmental Science0%
Physics0%
Agricultural and Food Sciences0%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

28Closed

Usability — can I run it?27

Reproducibility — can I retrain it?13

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Documentation

Key Features

Curriculum (two-stage) training: The model first aligns biomedical visual concepts with language using figure-caption pairs, then learns conversational reasoning from GPT-4-generated instruction data—mirroring how a person studies a new field before applying it.

GPT-4-generated instruction data: Self-instruct-style multimodal dialogues are synthesized from PubMed Central captions, removing the need for costly expert annotation to build instruction-following data.

Broad modality coverage: Training figures span histology, radiology (X-ray, CT, MRI), gross pathology, and other common biomedical image types.

Efficient adaptation: Full training completes in under 15 hours on eight A100 GPUs, making the recipe accessible to academic labs.

Open release: Code, instruction-tuning data, and model weights are released, with an updated LLaVA-Med v1.5 checkpoint built on Mistral-7B available on Hugging Face.

Technical Details

Applications

Impact

Top citations

The most-cited papers that cite this model.

LLaVA-Med

#Key Features

#Technical Details

#Applications

#Impact

Citation

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

LLaVA-Med

#Key Features

#Technical Details

#Applications

#Impact

Citation

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact