bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

MedVInT

Shanghai Jiao Tong University / Shanghai AI Laboratory

A generative medical visual question answering model that aligns a medical vision encoder with a large language model, trained on the 227k-pair PMC-VQA dataset.

Released: May 2023

MedVInT (Medical Visual Instruction Tuning) is a generative foundation model for medical visual question answering (VQA), introduced in the PMC-VQA paper by Xiaoman Zhang, Chaoyi Wu, Weidi Xie and colleagues at Shanghai Jiao Tong University and Shanghai AI Laboratory in May 2023. The model addresses a central limitation of earlier medical VQA systems, which treated the task as classification over a fixed answer vocabulary. By reframing medical VQA as an open-ended generative problem, MedVInT can produce free-form answers to clinical image questions rather than selecting from a predefined label set.

The model is trained on PMC-VQA, a large-scale dataset the authors built with a scalable generation pipeline drawing on figures and captions from PubMed Central open-access articles. PMC-VQA contains 227k question-answer pairs spanning 149k images across diverse imaging modalities and diseases, making it substantially broader in scope than the small, modality-specific benchmarks that preceded it.

MedVInT sits at the intersection of medical imaging and multimodal language modeling. It pairs a domain-adapted vision encoder with a medical large language model, building on the same group's PMC-CLIP and PMC-LLaMA work, and it established a public leaderboard to standardize evaluation of generative medical VQA systems.

#Key Features

  • Generative VQA formulation: Treats medical question answering as open-ended text generation rather than fixed-vocabulary classification, enabling free-form answers across modalities and clinical topics.
  • Two architectural variants: MedVInT-TE uses an encoder-style language model with a masked-language-modeling objective, while MedVInT-TD uses a decoder-style autoregressive LLM; both share a common vision pathway.
  • Domain-adapted backbones: Combines a PMC-CLIP ResNet-50 vision encoder with PMC-LLaMA, language and vision models pretrained on biomedical literature, rather than generic web-scale backbones.
  • Large, broad training corpus: Trained on PMC-VQA's 227k QA pairs over 149k images, covering many modalities and diseases sourced from open-access PubMed Central figures.
  • Open release: Code is MIT-licensed and model weights for both variants plus the PMC-VQA dataset are released on Hugging Face.

#Technical Details

MedVInT connects a pretrained vision encoder to a large language model through a trainable projection module. The vision pathway uses a ResNet-50 from PMC-CLIP, with either a 2-layer MLP or a 12-layer transformer projecting visual features into the language space. The TE (text-encoder) variant builds on encoder language models such as PubMedBERT, LLaMA-ENC, or PMC-LLaMA-ENC, with a 4-layer multimodal transformer decoder trained from scratch and a masked-language-modeling objective. The TD (text-decoder) variant uses decoder-style LLMs (LLaMA-7B or PMC-LLaMA-7B) directly as the multimodal decoder, and is first pretrained on PMC-OA image captioning before VQA fine-tuning. On the PMC-VQA test set, the strongest TD configuration (PMC-CLIP plus PMC-LLaMA) reaches 40.3% accuracy on multiple-choice and 33.6% on open-ended questions. On established benchmarks, MedVInT-TD attains 73.7%/86.8% (open/closed) on VQA-RAD and 84.5%/86.3% on SLAKE.

#Applications

MedVInT targets clinical and research scenarios where natural-language reasoning over medical images is useful, such as radiology and pathology image interpretation, automated report drafting, medical education, and interactive diagnostic support tools. Because it generates free-form answers across many modalities, it is better suited than fixed-label classifiers to the open-ended, heterogeneous questions that arise in real clinical workflows. The released weights and PMC-VQA dataset also serve as a baseline and training resource for researchers building and benchmarking medical multimodal assistants.

#Impact

By reframing medical VQA as generation and providing a large, openly licensed dataset and leaderboard, MedVInT and PMC-VQA helped catalyze the wave of medical multimodal language models that followed. PMC-VQA has become a widely used benchmark for evaluating generative medical VQA systems, and the accompanying PMC-CLIP and PMC-LLaMA backbones are frequently reused as biomedical foundation components. The model's main limitation is accuracy: even the best configurations remain well below clinical reliability, underscoring that current generative medical VQA is a research tool rather than a deployable diagnostic system.

Citation

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Preprint

Zhang, X., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv.org.

DOI: 10.48550/arXiv.2305.10415

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations347
Influential33
References72

GitHub

Stars233
Forks16
Open Issues15
Contributors2
Last Push1y ago
LanguagePython
LicenseMIT

HuggingFace

Downloads0
Likes2
Last Modified2y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
83Open
Usability — can I run it?94
Reproducibility — can I retrain it?78
Model Openness Framework
Unclassified
Missing required components

Tags

generativehistologymedical_image_understandingmultimodalradiologytransformervision_transformervisual_instruction_tuningvisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelHuggingFace ModelDataset