MedVInT

Shanghai Jiao Tong University / Shanghai AI Laboratory

Generative medical visual question answering model that pairs a vision encoder with a language model, trained on the 227k-pair PMC-VQA dataset.

Released: May 2023

MedVInT (Medical Visual Instruction Tuning) is a generative foundation model for medical visual question answering (VQA), introduced in the PMC-VQA paper by Xiaoman Zhang, Chaoyi Wu, Weidi Xie and colleagues at Shanghai Jiao Tong University and Shanghai AI Laboratory in May 2023. The model addresses a central limitation of earlier medical VQA systems, which treated the task as classification over a fixed answer vocabulary. By reframing medical VQA as an open-ended generative problem, MedVInT can produce free-form answers to clinical image questions rather than selecting from a predefined label set.

The model is trained on PMC-VQA, a large-scale dataset the authors built with a scalable generation pipeline drawing on figures and captions from PubMed Central open-access articles. PMC-VQA contains 227k question-answer pairs spanning 149k images across diverse imaging modalities and diseases, making it substantially broader in scope than the small, modality-specific benchmarks that preceded it.

MedVInT sits at the intersection of medical imaging and multimodal language modeling. It pairs a domain-adapted vision encoder with a medical large language model, building on the same group's PMC-CLIP and PMC-LLaMA work, and it established a public leaderboard to standardize evaluation of generative medical VQA systems.

Key Features

Generative VQA formulation: Treats medical question answering as open-ended text generation rather than fixed-vocabulary classification, enabling free-form answers across modalities and clinical topics.
Two architectural variants: MedVInT-TE uses an encoder-style language model with a masked-language-modeling objective, while MedVInT-TD uses a decoder-style autoregressive LLM; both share a common vision pathway.
Domain-adapted backbones: Combines a PMC-CLIP ResNet-50 vision encoder with PMC-LLaMA, language and vision models pretrained on biomedical literature, rather than generic web-scale backbones.
Large, broad training corpus: Trained on PMC-VQA's 227k QA pairs over 149k images, covering many modalities and diseases sourced from open-access PubMed Central figures.
Open release: Code is MIT-licensed and model weights for both variants plus the PMC-VQA dataset are released on Hugging Face.

Technical Details

MedVInT connects a pretrained vision encoder to a large language model through a trainable projection module. The vision pathway uses a ResNet-50 from PMC-CLIP, with either a 2-layer MLP or a 12-layer transformer projecting visual features into the language space. The TE (text-encoder) variant builds on encoder language models such as PubMedBERT, LLaMA-ENC, or PMC-LLaMA-ENC, with a 4-layer multimodal transformer decoder trained from scratch and a masked-language-modeling objective. The TD (text-decoder) variant uses decoder-style LLMs (LLaMA-7B or PMC-LLaMA-7B) directly as the multimodal decoder, and is first pretrained on PMC-OA image captioning before VQA fine-tuning. On the PMC-VQA test set, the strongest TD configuration (PMC-CLIP plus PMC-LLaMA) reaches 40.3% accuracy on multiple-choice and 33.6% on open-ended questions. On established benchmarks, MedVInT-TD attains 73.7%/86.8% (open/closed) on VQA-RAD and 84.5%/86.3% on SLAKE.

Applications

MedVInT targets clinical and research scenarios where natural-language reasoning over medical images is useful, such as radiology and pathology image interpretation, automated report drafting, medical education, and interactive diagnostic support tools. Because it generates free-form answers across many modalities, it is better suited than fixed-label classifiers to the open-ended, heterogeneous questions that arise in real clinical workflows. The released weights and PMC-VQA dataset also serve as a baseline and training resource for researchers building and benchmarking medical multimodal assistants.

Impact

By reframing medical VQA as generation and providing a large, openly licensed dataset and leaderboard, MedVInT and PMC-VQA helped catalyze the wave of medical multimodal language models that followed. PMC-VQA has become a widely used benchmark for evaluating generative medical VQA systems, and the accompanying PMC-CLIP and PMC-LLaMA backbones are frequently reused as biomedical foundation components. The model's main limitation is accuracy: even the best configurations remain well below clinical reliability, underscoring that current generative medical VQA is a research tool rather than a deployable diagnostic system.

Citation

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Preprint

Zhang, X., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv.org.

DOI: 10.48550/arXiv.2305.10415

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models
Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.
Jul 2026
0
MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation
Runhan Shi, Quan Zhou, Yuqian Xu, et al.
Jul 2026
0
Does AI Understand Imaging? A Systematic Benchmark of Agentic AI for Computational Imaging Tasks
Ethan Chung, Chuanjun Zheng, Jasper Tan, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, et al.
International Conference on Learning Representations · Oct 2023
1.6KInfluential
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Zhe Chen, Weiyun Wang, Yue Cao, et al.
arXiv.org · Dec 2024
1.5K
A survey on multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, et al.
National Science Review · Jun 2023
1.4KInfluential
Instruction Tuning for Large Language Models: A Survey
Shengyu Zhang, Linfeng Dong, Xiaoya Li, et al.
ACM Computing Surveys · Aug 2023
879
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, et al.
arXiv.org · Mar 2025
566

Citations

Total Citations367

Influential36

References72

GitHub

Stars236

Forks16

Open Issues15

Contributors2

Last Push1y ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes2

Last Modified3y ago

Fields of citing research

Computer Science99%
Medicine74%
Engineering10%
Linguistics3%
Biology2%
Mathematics1%
Environmental Science1%
Physics1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

83Open

Usability — can I run it?94

Reproducibility — can I retrain it?78

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model HuggingFace Model Dataset

Key Features

Generative VQA formulation: Treats medical question answering as open-ended text generation rather than fixed-vocabulary classification, enabling free-form answers across modalities and clinical topics.

Two architectural variants: MedVInT-TE uses an encoder-style language model with a masked-language-modeling objective, while MedVInT-TD uses a decoder-style autoregressive LLM; both share a common vision pathway.

Domain-adapted backbones: Combines a PMC-CLIP ResNet-50 vision encoder with PMC-LLaMA, language and vision models pretrained on biomedical literature, rather than generic web-scale backbones.

Large, broad training corpus: Trained on PMC-VQA's 227k QA pairs over 149k images, covering many modalities and diseases sourced from open-access PubMed Central figures.

Open release: Code is MIT-licensed and model weights for both variants plus the PMC-VQA dataset are released on Hugging Face.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.

Jul 2026

MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation

Runhan Shi, Quan Zhou, Yuqian Xu, et al.

Jul 2026

Does AI Understand Imaging? A Systematic Benchmark of Agentic AI for Computational Imaging Tasks

Ethan Chung, Chuanjun Zheng, Jasper Tan, et al.

Jul 2026

Top citations

The most-cited papers that cite this model.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, et al.

International Conference on Learning Representations · Oct 2023

1.6KInfluential

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, et al.

arXiv.org · Dec 2024

1.5K

A survey on multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, et al.

National Science Review · Jun 2023

1.4KInfluential

Instruction Tuning for Large Language Models: A Survey

Shengyu Zhang, Linfeng Dong, Xiaoya Li, et al.

ACM Computing Surveys · Aug 2023

879

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, et al.

arXiv.org · Mar 2025

566

MedVInT

#Key Features

#Technical Details

#Applications

#Impact

Citation

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation

Does AI Understand Imaging? A Systematic Benchmark of Agentic AI for Computational Imaging Tasks

Top citations

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MedVInT

#Key Features

#Technical Details

#Applications

#Impact

Citation

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

MedRealMM: A Real-World Multimodal Benchmark for Chinese Online Medical Consultation

Does AI Understand Imaging? A Systematic Benchmark of Agentic AI for Computational Imaging Tasks

Top citations

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact