Technical University of Munich / Imperial College London / University of Oxford
A 2B-parameter medical vision-language model that uses reinforcement learning (GRPO) to produce explicit, human-interpretable reasoning for radiology visual question answering.
MedVLM-R1 is a compact medical vision-language model (VLM) that generates explicit, natural-language reasoning alongside its answers to questions about radiology images. It targets a central trust problem in medical AI: most diagnostic models output a final answer without showing how they arrived at it, which limits clinical confidence. Rather than relying on supervised fine-tuning over chains of reasoning, MedVLM-R1 uses a reinforcement learning (RL) framework that rewards the model for discovering human-interpretable reasoning paths on its own, without any reasoning references in the training data.
The model was introduced in February 2025 by Jiazhen Pan, Che Liu, Daniel Rueckert and colleagues at the Technical University of Munich, Imperial College London, and the University of Oxford, and the work was subsequently accepted at MICCAI 2025. It applies the DeepSeek-R1-style "incentivized reasoning" recipe — popularized for text language models — to the multimodal medical imaging domain, where interpretable, verifiable reasoning is especially valuable.
MedVLM-R1 sits at the intersection of medical imaging analysis and reasoning language models. Its central finding is that a small 2B-parameter model, trained with RL on only 600 visual question answering (VQA) samples, can outperform conventionally fine-tuned models trained on more than a million samples, while also producing transparent reasoning traces.
<think> reasoning followed by a final answer without any supervised reasoning labels.MedVLM-R1 is built on the Qwen2-VL-2B-Instruct backbone, a vision-language transformer pairing a vision encoder with a 2B-parameter language model. It is trained with GRPO, a reinforcement learning algorithm that compares groups of sampled responses and optimizes toward higher-reward outputs using simple, verifiable reward functions — one for matching the correct multiple-choice answer and one for adhering to the required reasoning-then-answer format. Training used 600 MRI VQA samples from the HuatuoGPT-Vision dataset, with evaluation drawn from OmniMedVQA across MRI, CT, and X-ray. On these benchmarks, MedVLM-R1 raised accuracy from 55.11% to 78.22%, and exceeded the performance of much larger VLMs fine-tuned on over one million samples. The authors also report failure cases in which the generated reasoning is superficial or contradictory, noting that reasoning quality does not always track answer correctness.
MedVLM-R1 is aimed at medical visual question answering and radiology decision support, where clinicians and researchers benefit from seeing a model's reasoning rather than only its final answer. Its compact size makes it practical to deploy in resource-constrained settings, and its data-efficient RL recipe offers a template for building interpretable diagnostic assistants in specialties where large labeled reasoning datasets are scarce. Released openly, it is well suited to research on trustworthy medical AI, reasoning evaluation, and modality transfer.
MedVLM-R1 is an early demonstration that DeepSeek-R1-style reinforcement learning, which elicited emergent reasoning in text language models, transfers to small medical vision-language models. By showing that a 2B-parameter model trained on hundreds (not millions) of samples can both outperform larger supervised baselines and produce interpretable reasoning, it highlights RL as a data-efficient path toward transparent clinical AI. Its open weights and code, and acceptance at MICCAI 2025, have made it a reference point for subsequent work on reasoning-centric medical VLMs, while the authors' candid analysis of unfaithful reasoning underscores that interpretable output remains an open challenge.
Pan, J., et al. (2025) MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.48550/arXiv.2502.19634Pan, J., et al. (2025) MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. Medical Image Computing and Computer Assisted Intervention – MICCAI 2025.
DOI: 10.1007/978-3-032-04981-0_32Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data