LLaVA-Tri

UC Santa Cruz / Huazhong University of Science and Technology / Harvard University / Stanford University

Medical vision-language model trained on the MedTrinity-25M dataset, answering questions and generating text about radiology and histology images.

Released: August 2024

LLaVA-Tri is a medical multimodal large language model that pairs a vision encoder with a large language model to answer questions and generate text about medical images. It was introduced in August 2024 by researchers at UC Santa Cruz (the VLAA lab), Huazhong University of Science and Technology, Harvard University, and Stanford University as the flagship demonstration model for MedTrinity-25M, a large-scale multimodal medical dataset, and the work was accepted to ICLR 2025.

The central contribution of the paper is MedTrinity-25M itself: a dataset of over 25 million medical images spanning 10 imaging modalities, with multigranular annotations covering more than 65 diseases. Rather than relying on scarce paired image-text reports, the authors built an automated pipeline that generates image-ROI-description triplets, attaching both global captions and localized region-of-interest descriptions to each image. LLaVA-Tri exists to show that this richly annotated corpus translates into a stronger medical vision-language model.

Built on the LLaVA architecture and aligned to the medical domain through a tri-stage training recipe, LLaVA-Tri sets state-of-the-art results across the standard biomedical visual question answering (VQA) benchmarks, positioning it alongside models such as LLaVA-Med in the medical multimodal assistant landscape while leveraging substantially richer training supervision.

Key Features

Tri-stage training recipe: Training proceeds through concept alignment on 600K image-text pairs from PMC-15M, multigranular alignment on MedTrinity-25M, and task-specific fine-tuning on individual VQA benchmarks.
Multigranular supervision: Leverages MedTrinity-25M's image-ROI-description triplets so the model learns both whole-image semantics and localized region-of-interest detail rather than image-level captions alone.
LLaMA3 language backbone: Integrates LLaMA3 to strengthen the model's linguistic reasoning over medical text compared to earlier LLaVA-Med releases.
Multiscale visual features: Incorporates multiscale feature extraction to improve performance on fine-grained visual findings across modalities.
Broad modality coverage: Trained on data spanning 10 imaging modalities, including radiology, pathology, and other clinical image types.

Technical Details

LLaVA-Tri follows the LLaVA design, connecting a vision encoder to a LLaMA3 language model via a projection layer, with added multiscale feature extraction. The first stage aligns visual and language representations on 600K image-text pairs drawn from PMC-15M; the second stage performs multigranular alignment on the 25M-image MedTrinity-25M corpus, which provides global captions plus region-of-interest descriptions for more than 65 diseases across 10 modalities; the final stage fine-tunes on each target benchmark (reported as 15 epochs). On the three standard biomedical VQA benchmarks, LLaVA-Tri reports 75.0% accuracy on VQA-RAD, 87.8% on SLAKE, and 65.3% on PathVQA, state-of-the-art across all three at publication. The official code is released under the UCSC-VLAA GitHub repository, with per-benchmark checkpoints and the MedTrinity-25M dataset available on Hugging Face.

Applications

LLaVA-Tri targets medical image understanding tasks such as visual question answering, captioning, and report generation across radiology, pathology, and other clinical imaging modalities. It is primarily a research artifact demonstrating the value of large-scale multigranular annotation, useful to machine-learning researchers building medical vision-language assistants and to groups that want a strong baseline or starting checkpoint for medical VQA. As with all such models, outputs are not clinically validated and the model is intended for research rather than direct diagnostic use.

Impact

The lasting contribution of this work is MedTrinity-25M, one of the largest multigranular medical multimodal datasets and a widely cited resource for training medical vision-language models. LLaVA-Tri demonstrates that the dataset's automated, ROI-aware annotation pipeline yields measurable downstream gains, setting state-of-the-art VQA accuracy and providing a reusable recipe for aligning general-purpose LLaVA-style models to medicine. Acceptance to ICLR 2025 and the public release of data, code, and checkpoints have made the resources a common foundation for subsequent medical multimodal research.

Citation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Preprint

Xie, Y., et al. (2024) MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2408.02900

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models
Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.
Jul 2026
0
Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning
Junha Jung, Minbyul Jeong, Suhyeon Lim, et al.
Jun 2026
1
Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning
Kaitao Chen, Weiqian Zhao, Jiamin Wu, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He, Rui Mao, Qika Lin, et al.
Information Fusion · Oct 2023
328
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, X. Liu, et al.
Information Fusion · May 2024
116
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
Juncheng Wu, Wenlong Deng, Xingxuan Li, et al.
arXiv.org · Apr 2025
93
Large-vocabulary segmentation for medical images with text prompts
Ziheng Zhao, Yao Zhang, Chaoyi Wu, et al.
npj Digital Medicine · Dec 2023
77
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
Vishwesh Nath, Wenqi Li, Dong Yang, et al.
Computer Vision and Pattern Recognition · Nov 2024
61

Citations

Total Citations94

Influential12

References76

GitHub

Stars413

Forks31

Open Issues11

Contributors1

Last Push1y ago

LanguagePython

HuggingFace

Downloads4

Likes0

Last Modified1y ago

Fields of citing research

Computer Science100%
Medicine89%
Engineering14%
Mathematics1%
Education1%
Biology1%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

30Closed

Usability — can I run it?24

Reproducibility — can I retrain it?23

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website HuggingFace Model Dataset

Key Features

Tri-stage training recipe: Training proceeds through concept alignment on 600K image-text pairs from PMC-15M, multigranular alignment on MedTrinity-25M, and task-specific fine-tuning on individual VQA benchmarks.

Multigranular supervision: Leverages MedTrinity-25M's image-ROI-description triplets so the model learns both whole-image semantics and localized region-of-interest detail rather than image-level captions alone.

LLaMA3 language backbone: Integrates LLaMA3 to strengthen the model's linguistic reasoning over medical text compared to earlier LLaVA-Med releases.

Multiscale visual features: Incorporates multiscale feature extraction to improve performance on fine-grained visual findings across modalities.

Broad modality coverage: Trained on data spanning 10 imaging modalities, including radiology, pathology, and other clinical image types.

Technical Details

Applications

Impact

Citation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Preprint

Xie, Y., et al. (2024) MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2408.02900

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.

Jul 2026

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Junha Jung, Minbyul Jeong, Suhyeon Lim, et al.

Jun 2026

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Kaitao Chen, Weiqian Zhao, Jiamin Wu, et al.

Jun 2026

LLaVA-Tri

#Key Features

#Technical Details

#Applications

#Impact

Citation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

LLaVA-Tri

#Key Features

#Technical Details

#Applications

#Impact

Citation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact