Shanghai AI Laboratory / Fuzhou University / Shanghai Innovation Institute / Fudan University / Monash University / University of Washington / Stanford University
A reinforcement-learning-enhanced general medical vision-language model that adds step-by-step reasoning for medical image diagnosis and visual question answering.
GMAI-VL-R1 is a multimodal medical reasoning model that augments a general medical vision-language model with explicit, step-by-step reasoning learned through reinforcement learning (RL). It addresses a recurring weakness of existing general medical AI systems: while they can describe a medical image or answer a direct question, they often lack the structured reasoning needed for complex clinical decision-making, where intermediate inference steps matter as much as the final answer.
The model was introduced in April 2025 by researchers from Shanghai Artificial Intelligence Laboratory (the "uni-medical" group), Fuzhou University, Shanghai Innovation Institute, Fudan University, Monash University, the University of Washington, and Stanford University. It belongs to the broader GMAI-VL family of general medical AI vision-language models but distinguishes itself by being trained with verifiable-reward RL rather than supervised fine-tuning alone. The authors are among the first to apply Group Relative Policy Optimization (GRPO) to the multimodal medical domain at scale.
GMAI-VL-R1 fits into a fast-growing line of "reasoning-enhanced" medical multimodal models (alongside efforts such as MedVLM-R1), where RL on verifiable medical questions is used to elicit chain-of-thought style reasoning that generalizes to unseen tasks better than memorization-driven supervised training.
GMAI-VL-R1 is built on the Qwen2.5-VL vision-language backbone, with a primary 7B-parameter model and a smaller 3B variant. Training applies GRPO, a policy-gradient RL method that computes advantages over groups of sampled responses with KL-divergence regularization against a reference policy, using correctness on multiple-choice medical questions as the verifiable reward signal. The GMAI-Reasoning10K training set aggregates roughly 10,000 questions distilled from 95 public medical datasets across five imaging modalities. On evaluation, the 7B RL-tuned model improves over its supervised fine-tuning baseline on several benchmarks: GMAI-MMBench (val) rises to 43.14% and the validation split improves by about 3 points, with comparable gains on MMMU, MMMU-pro, and MedXpertQA-MM. The authors report that RL training generalizes better on out-of-distribution tasks, while supervised fine-tuning retains an edge on some in-distribution benchmarks such as OmniMedVQA.
GMAI-VL-R1 targets medical image interpretation and visual question answering across radiology (X-ray, CT, MRI), ophthalmology (OCT), and ultrasound, with use cases in diagnostic support, clinical decision assistance, and medical education. By producing explicit reasoning chains rather than bare answers, it is better suited to settings where clinicians need to inspect and verify the rationale behind a model's output. As a research artifact with open weights and data, it also serves as a reproducible baseline for studying reinforcement learning approaches to medical multimodal reasoning.
GMAI-VL-R1 is one of the early demonstrations that GRPO-style reinforcement learning with verifiable rewards can be applied to general medical vision-language models, extending the "reasoning model" paradigm popularized in general LLMs into the multimodal medical domain. Its public release of code, the GMAI-Reasoning10K dataset, and model weights lowers the barrier for follow-up work on RL-based medical reasoning. The finding that RL improves out-of-distribution generalization relative to supervised fine-tuning offers practical guidance for building medical AI systems that must operate across heterogeneous imaging sources. As a preprint, its benchmark numbers should be read as research results pending peer review, and like other medical multimodal models it is intended for research rather than autonomous clinical use.
Su, Y., et al. (2025) GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning. arXiv.org.
DOI: 10.48550/arXiv.2504.01886Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data