Emory University / University of Southern California / University of Tokyo / Johns Hopkins University / Georgia Institute of Technology
A reinforcement-learning-trained medical vision-language model for generalizable reasoning across eight imaging modalities and five clinical question types.
Med-R1 is a vision-language model (VLM) for medical reasoning that is trained with reinforcement learning rather than the supervised fine-tuning typically used to adapt general VLMs to clinical tasks. Introduced in March 2025 by researchers at Emory University, the University of Southern California, the University of Tokyo, Johns Hopkins University, and the Georgia Institute of Technology, it targets a persistent weakness of medical VLMs: models tuned on one imaging modality or question format often fail to transfer to others, limiting their usefulness across the heterogeneous landscape of clinical imaging.
The central idea is to apply Group Relative Policy Optimization (GRPO) — the reward-guided strategy popularized by DeepSeek-R1 — to a compact open VLM, encouraging it to learn generalizable decision policies instead of memorizing dataset-specific annotations. Built on the 2-billion-parameter Qwen2-VL-2B-Instruct backbone, Med-R1 spans eight imaging modalities (CT, MRI, ultrasound, X-ray, fundus photography, OCT, dermoscopy, and microscopy) and five clinical question types, positioning it as a broad-coverage medical reasoning model rather than a single-task classifier.
A notable empirical finding is that explicit chain-of-thought reasoning is not always beneficial in this setting. The authors report that a "No-Thinking" variant, which omits intermediate reasoning steps, can outperform the reasoning-augmented model, suggesting that in medical VQA the quality and domain alignment of reasoning — not its mere presence — drive performance.
Med-R1 fine-tunes Qwen2-VL-2B-Instruct using GRPO, with input images resized to 384x384 pixels. Training and evaluation use the open-access portion of the OmniMedVQA benchmark — roughly 82,000 images and 89,000 visual question-answer pairs spanning the eight modalities and five question types — split 80/20 for training and testing. The release provides separate cross-modality and cross-task checkpoints. Reported results include a 29.94% average accuracy improvement over the Qwen2-VL-2B base model and a 32.06% gain in cross-question-type generalization, with the 2B model outperforming Qwen2-VL-72B on the studied medical reasoning tasks.
Med-R1 is aimed at medical visual question answering, where a clinician or downstream system supplies an image and a natural-language question and the model returns an answer, optionally with a reasoning trace. Because it generalizes across modalities and task formats without per-task retraining, it is well suited to building flexible diagnostic-support prototypes, triage assistants, and educational tools that must handle radiology, ophthalmology, dermatology, and pathology imagery side by side. Its small footprint makes it attractive for resource-constrained or on-premise deployments where larger VLMs are impractical.
Med-R1 contributes to a fast-growing line of work applying DeepSeek-R1-style reinforcement learning to multimodal medical models, appearing alongside closely related efforts such as MedVLM-R1. Its main contributions are evidence that GRPO can yield strong cross-modality and cross-task generalization in a small open VLM, and the counterintuitive observation that explicit reasoning steps are not universally helpful for medical VQA. As a research artifact with openly released weights and data, it lowers the barrier for studying RL-based medical reasoning. The work remains a preprint, and its evaluation is confined to the OmniMedVQA benchmark, so reported gains should be interpreted as benchmark results rather than validated clinical performance.
Lai, Y., et al. (2025) Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models. IEEE Transactions on Medical Imaging.
DOI: 10.48550/arXiv.2503.13939Lai, Y., et al. (2025) Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models. IEEE Transactions on Medical Imaging.
DOI: 10.1109/TMI.2026.3661001Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data