A biomedical vision-language assistant from Microsoft Research, adapted from LLaVA via curriculum learning on PubMed Central figure-caption pairs and GPT-4-generated instructions.
LLaVA-Med (Large Language-and-Vision Assistant for Biomedicine) is a multimodal conversational model that answers open-ended questions about biomedical images such as histopathology slides, radiographs, CT and MRI scans, and gross pathology photographs. Introduced by Microsoft Research in June 2023 and published at NeurIPS 2023, it adapts the general-domain LLaVA vision-language assistant to medicine, aiming to bring GPT-4-style multimodal conversation into a domain where general models struggle because they have never seen the specialized vocabulary or imaging modalities.
The central contribution is a cost-efficient recipe for domain adaptation. Rather than training from scratch, the authors leverage PMC-15M—a large-scale collection of 15 million figure-caption pairs extracted from PubMed Central articles—and use GPT-4 to generate biomedical instruction-following dialogue from the captions. The full medical assistant can be trained in under 15 hours on eight A100 GPUs, which the paper's title highlights as training "in one day." This made high-quality biomedical multimodal assistants reproducible on modest academic budgets.
LLaVA-Med sits at the intersection of biomedical pathology/imaging and large language models, and it became an influential reference point for the wave of open biomedical vision-language models that followed. A later v1.5 checkpoint rebuilt the assistant on Mistral-7B-Instruct.
LLaVA-Med inherits the LLaVA architecture: a CLIP-style vision encoder produces image features that a projection layer maps into the embedding space of a large language model (Vicuna in the original release; Mistral-7B-Instruct-v0.2 in v1.5, roughly 7B parameters). Stage one (concept alignment) trains only the projection on roughly 500K figure-caption pairs sampled from PMC-15M, teaching the model to ground biomedical visual concepts. Stage two fine-tunes on about 60K GPT-4- generated multi-round instruction-following conversations, giving the assistant open-ended dialogue ability. On three established biomedical visual question answering benchmarks—VQA-RAD, SLAKE, and PathVQA—a fine-tuned LLaVA-Med matches or exceeds prior supervised state-of-the-art on several metrics, particularly for open-set (free-form) questions. The repository also ships a GPT-assisted evaluation pipeline and a Gradio web UI.
LLaVA-Med is intended as a research tool and conversational assistant for biomedical image understanding: answering questions about pathology and radiology figures, captioning biomedical images, and serving as a strong baseline or starting checkpoint for downstream medical VQA and report-generation systems. Researchers benefit from its open weights and data recipe, which let them reproduce or extend domain-specific multimodal assistants. Microsoft explicitly restricts the released model to research and reproducibility, prohibiting clinical care or clinical decision-making use.
LLaVA-Med demonstrated that a capable biomedical multimodal assistant could be built cheaply by combining web-scale figure-caption data with synthetic GPT-4-generated instructions, and its recipe was widely adopted and extended by subsequent open biomedical vision-language models. As one of the earliest openly released medical instruction-tuned multimodal models, it became a common baseline in medical VQA research. Its main limitations are honestly noted by the authors: it is English-only, evaluated on a narrow set of benchmarks, can hallucinate, may inherit biases from the academic-publication distribution of PMC-15M, and is not validated or approved for clinical use.
Li, C., et al. (2023) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. Neural Information Processing Systems.
DOI: 10.48550/arXiv.2306.00890Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data