bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language model foundation models
Language modelPathology

LLaVA-Tri

UC Santa Cruz / Huazhong University of Science and Technology / Harvard University / Stanford University

A medical multimodal large language model pretrained on the 25M-image MedTrinity-25M dataset, achieving state-of-the-art accuracy on biomedical visual question answering.

Released: August 2024

LLaVA-Tri is a medical multimodal large language model that pairs a vision encoder with a large language model to answer questions and generate text about medical images. It was introduced in August 2024 by researchers at UC Santa Cruz (the VLAA lab), Huazhong University of Science and Technology, Harvard University, and Stanford University as the flagship demonstration model for MedTrinity-25M, a large-scale multimodal medical dataset, and the work was accepted to ICLR 2025.

The central contribution of the paper is MedTrinity-25M itself: a dataset of over 25 million medical images spanning 10 imaging modalities, with multigranular annotations covering more than 65 diseases. Rather than relying on scarce paired image-text reports, the authors built an automated pipeline that generates image-ROI-description triplets, attaching both global captions and localized region-of-interest descriptions to each image. LLaVA-Tri exists to show that this richly annotated corpus translates into a stronger medical vision-language model.

Built on the LLaVA architecture and aligned to the medical domain through a tri-stage training recipe, LLaVA-Tri sets state-of-the-art results across the standard biomedical visual question answering (VQA) benchmarks, positioning it alongside models such as LLaVA-Med in the medical multimodal assistant landscape while leveraging substantially richer training supervision.

#Key Features

  • Tri-stage training recipe: Training proceeds through concept alignment on 600K image-text pairs from PMC-15M, multigranular alignment on MedTrinity-25M, and task-specific fine-tuning on individual VQA benchmarks.
  • Multigranular supervision: Leverages MedTrinity-25M's image-ROI-description triplets so the model learns both whole-image semantics and localized region-of-interest detail rather than image-level captions alone.
  • LLaMA3 language backbone: Integrates LLaMA3 to strengthen the model's linguistic reasoning over medical text compared to earlier LLaVA-Med releases.
  • Multiscale visual features: Incorporates multiscale feature extraction to improve performance on fine-grained visual findings across modalities.
  • Broad modality coverage: Trained on data spanning 10 imaging modalities, including radiology, pathology, and other clinical image types.

#Technical Details

LLaVA-Tri follows the LLaVA design, connecting a vision encoder to a LLaMA3 language model via a projection layer, with added multiscale feature extraction. The first stage aligns visual and language representations on 600K image-text pairs drawn from PMC-15M; the second stage performs multigranular alignment on the 25M-image MedTrinity-25M corpus, which provides global captions plus region-of-interest descriptions for more than 65 diseases across 10 modalities; the final stage fine-tunes on each target benchmark (reported as 15 epochs). On the three standard biomedical VQA benchmarks, LLaVA-Tri reports 75.0% accuracy on VQA-RAD, 87.8% on SLAKE, and 65.3% on PathVQA, state-of-the-art across all three at publication. The official code is released under the UCSC-VLAA GitHub repository, with per-benchmark checkpoints and the MedTrinity-25M dataset available on Hugging Face.

#Applications

LLaVA-Tri targets medical image understanding tasks such as visual question answering, captioning, and report generation across radiology, pathology, and other clinical imaging modalities. It is primarily a research artifact demonstrating the value of large-scale multigranular annotation, useful to machine-learning researchers building medical vision-language assistants and to groups that want a strong baseline or starting checkpoint for medical VQA. As with all such models, outputs are not clinically validated and the model is intended for research rather than direct diagnostic use.

#Impact

The lasting contribution of this work is MedTrinity-25M, one of the largest multigranular medical multimodal datasets and a widely cited resource for training medical vision-language models. LLaVA-Tri demonstrates that the dataset's automated, ROI-aware annotation pipeline yields measurable downstream gains, setting state-of-the-art VQA accuracy and providing a reusable recipe for aligning general-purpose LLaVA-style models to medicine. Acceptance to ICLR 2025 and the public release of data, code, and checkpoints have made the resources a common foundation for subsequent medical multimodal research.

Citation

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Preprint

Xie, Y., et al. (2024) MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2408.02900

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations84
Influential9
References76

GitHub

Stars409
Forks30
Open Issues11
Contributors1
Last Push11mo ago
LanguagePython

HuggingFace

Downloads2
Likes0
Last Modified11mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
30Closed
Usability — can I run it?24
Reproducibility — can I retrain it?23
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

foundation_modelhistologymultimodalradiologyreport_generationtransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperOfficial WebsiteHuggingFace ModelDataset