Shanghai Jiao Tong University / Shanghai AI Laboratory
A generative medical visual question answering model that aligns a medical vision encoder with a large language model, trained on the 227k-pair PMC-VQA dataset.
MedVInT (Medical Visual Instruction Tuning) is a generative foundation model for medical visual question answering (VQA), introduced in the PMC-VQA paper by Xiaoman Zhang, Chaoyi Wu, Weidi Xie and colleagues at Shanghai Jiao Tong University and Shanghai AI Laboratory in May 2023. The model addresses a central limitation of earlier medical VQA systems, which treated the task as classification over a fixed answer vocabulary. By reframing medical VQA as an open-ended generative problem, MedVInT can produce free-form answers to clinical image questions rather than selecting from a predefined label set.
The model is trained on PMC-VQA, a large-scale dataset the authors built with a scalable generation pipeline drawing on figures and captions from PubMed Central open-access articles. PMC-VQA contains 227k question-answer pairs spanning 149k images across diverse imaging modalities and diseases, making it substantially broader in scope than the small, modality-specific benchmarks that preceded it.
MedVInT sits at the intersection of medical imaging and multimodal language modeling. It pairs a domain-adapted vision encoder with a medical large language model, building on the same group's PMC-CLIP and PMC-LLaMA work, and it established a public leaderboard to standardize evaluation of generative medical VQA systems.
MedVInT connects a pretrained vision encoder to a large language model through a trainable projection module. The vision pathway uses a ResNet-50 from PMC-CLIP, with either a 2-layer MLP or a 12-layer transformer projecting visual features into the language space. The TE (text-encoder) variant builds on encoder language models such as PubMedBERT, LLaMA-ENC, or PMC-LLaMA-ENC, with a 4-layer multimodal transformer decoder trained from scratch and a masked-language-modeling objective. The TD (text-decoder) variant uses decoder-style LLMs (LLaMA-7B or PMC-LLaMA-7B) directly as the multimodal decoder, and is first pretrained on PMC-OA image captioning before VQA fine-tuning. On the PMC-VQA test set, the strongest TD configuration (PMC-CLIP plus PMC-LLaMA) reaches 40.3% accuracy on multiple-choice and 33.6% on open-ended questions. On established benchmarks, MedVInT-TD attains 73.7%/86.8% (open/closed) on VQA-RAD and 84.5%/86.3% on SLAKE.
MedVInT targets clinical and research scenarios where natural-language reasoning over medical images is useful, such as radiology and pathology image interpretation, automated report drafting, medical education, and interactive diagnostic support tools. Because it generates free-form answers across many modalities, it is better suited than fixed-label classifiers to the open-ended, heterogeneous questions that arise in real clinical workflows. The released weights and PMC-VQA dataset also serve as a baseline and training resource for researchers building and benchmarking medical multimodal assistants.
By reframing medical VQA as generation and providing a large, openly licensed dataset and leaderboard, MedVInT and PMC-VQA helped catalyze the wave of medical multimodal language models that followed. PMC-VQA has become a widely used benchmark for evaluating generative medical VQA systems, and the accompanying PMC-CLIP and PMC-LLaMA backbones are frequently reused as biomedical foundation components. The model's main limitation is accuracy: even the best configurations remain well below clinical reliability, underscoring that current generative medical VQA is a research tool rather than a deployable diagnostic system.
Zhang, X., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv.org.
DOI: 10.48550/arXiv.2305.10415Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data