UC Santa Cruz / Huazhong University of Science and Technology / Harvard University / Stanford University
A medical multimodal large language model pretrained on the 25M-image MedTrinity-25M dataset, achieving state-of-the-art accuracy on biomedical visual question answering.
LLaVA-Tri is a medical multimodal large language model that pairs a vision encoder with a large language model to answer questions and generate text about medical images. It was introduced in August 2024 by researchers at UC Santa Cruz (the VLAA lab), Huazhong University of Science and Technology, Harvard University, and Stanford University as the flagship demonstration model for MedTrinity-25M, a large-scale multimodal medical dataset, and the work was accepted to ICLR 2025.
The central contribution of the paper is MedTrinity-25M itself: a dataset of over 25 million medical images spanning 10 imaging modalities, with multigranular annotations covering more than 65 diseases. Rather than relying on scarce paired image-text reports, the authors built an automated pipeline that generates image-ROI-description triplets, attaching both global captions and localized region-of-interest descriptions to each image. LLaVA-Tri exists to show that this richly annotated corpus translates into a stronger medical vision-language model.
Built on the LLaVA architecture and aligned to the medical domain through a tri-stage training recipe, LLaVA-Tri sets state-of-the-art results across the standard biomedical visual question answering (VQA) benchmarks, positioning it alongside models such as LLaVA-Med in the medical multimodal assistant landscape while leveraging substantially richer training supervision.
LLaVA-Tri follows the LLaVA design, connecting a vision encoder to a LLaMA3 language model via a projection layer, with added multiscale feature extraction. The first stage aligns visual and language representations on 600K image-text pairs drawn from PMC-15M; the second stage performs multigranular alignment on the 25M-image MedTrinity-25M corpus, which provides global captions plus region-of-interest descriptions for more than 65 diseases across 10 modalities; the final stage fine-tunes on each target benchmark (reported as 15 epochs). On the three standard biomedical VQA benchmarks, LLaVA-Tri reports 75.0% accuracy on VQA-RAD, 87.8% on SLAKE, and 65.3% on PathVQA, state-of-the-art across all three at publication. The official code is released under the UCSC-VLAA GitHub repository, with per-benchmark checkpoints and the MedTrinity-25M dataset available on Hugging Face.
LLaVA-Tri targets medical image understanding tasks such as visual question answering, captioning, and report generation across radiology, pathology, and other clinical imaging modalities. It is primarily a research artifact demonstrating the value of large-scale multigranular annotation, useful to machine-learning researchers building medical vision-language assistants and to groups that want a strong baseline or starting checkpoint for medical VQA. As with all such models, outputs are not clinically validated and the model is intended for research rather than direct diagnostic use.
The lasting contribution of this work is MedTrinity-25M, one of the largest multigranular medical multimodal datasets and a widely cited resource for training medical vision-language models. LLaVA-Tri demonstrates that the dataset's automated, ROI-aware annotation pipeline yields measurable downstream gains, setting state-of-the-art VQA accuracy and providing a reusable recipe for aligning general-purpose LLaVA-style models to medicine. Acceptance to ICLR 2025 and the public release of data, code, and checkpoints have made the resources a common foundation for subsequent medical multimodal research.
Xie, Y., et al. (2024) MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. International Conference on Learning Representations.
DOI: 10.48550/arXiv.2408.02900Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data