Hong Kong University of Science and Technology / Sun Yat-sen University
Region-aware bilingual (Chinese-English) medical multimodal LLM that handles image- and region-level vision-language tasks across eight imaging modalities.
MedRegA is a region-aware, bilingual (Chinese-English) medical multimodal large language model designed to handle a broad spectrum of biomedical vision-language tasks within a single generalist system. Most medical MLLMs reason over a whole image at once, which makes their outputs difficult to interpret and prone to overlooking the small anatomical structures or lesions that drive clinical decisions. MedRegA addresses this by explicitly grounding its reasoning in image regions, mimicking the clinical workflow in which a radiologist surveys an entire scan and then focuses attention on specific areas before reaching a conclusion.
The model was developed by Lehan Wang, Haonan Wang, Honglong Yang, and Xiaomeng Li of the Hong Kong University of Science and Technology, together with radiologist collaborators Jiaji Mao, Zehong Yang, and Jun Shen from Sun Yat-sen Memorial Hospital, Sun Yat-sen University. It was first released as a preprint in October 2024 and accepted to ICLR 2025.
To train region-aware behavior, the authors introduce MedRegInstruct, an instruction-tuning corpus in which samples are paired with the coordinates of body structures or lesions. This lets MedRegA serve as an interpretable generalist that can both answer questions about whole images and localize, identify, and report on specific anatomical regions across eight medical imaging modalities.
MedRegA is built on the InternVL-Chat-V1-2 backbone, a roughly 40-billion-parameter vision-language model that couples a vision transformer image encoder with a large language model decoder. The authors adapt this generalist foundation through instruction tuning on MedRegInstruct, which augments standard image-text supervision with explicit region coordinates so the model learns to attend to and describe localized structures. Training data is drawn from eight public medical imaging repositories spanning chest X-ray (MIMIC-CXR, VinDr-CXR, VinDr-PCXR), mammography (VinDr-Mammo), spine X-ray (VinDr-SpineXR), dermatology (ISIC), histopathology (PanNuke), and the large segmentation collection SA-Med-2D-20M. Across image-level and region-level benchmarks, MedRegA reports competitive or superior performance relative to general and medical MLLMs on visual question answering, report generation, medical image classification, and region detection, with the region-grounding capability providing interpretability that single-image models lack.
MedRegA targets clinical and research scenarios where both broad coverage and fine-grained localization matter. Radiologists and clinicians can use it to generate grounded reports in which findings are tied to specific image regions, to detect and identify anatomical structures or lesions, and to answer questions across diverse modalities — all from one model rather than a collection of task-specific tools. Its bilingual support makes it particularly relevant for Chinese-language clinical workflows, and the region-grounding output offers a degree of transparency useful for second-read assistance and education.
By coupling generalist breadth with explicit region grounding, MedRegA advances the interpretability of medical MLLMs, an area where opaque whole-image reasoning has limited clinical trust. Its acceptance at ICLR 2025 and the open release of code, weights, and the MedRegInstruct dataset lower the barrier for follow-on work on region-aware medical vision-language modeling. As with other research-stage medical MLLMs, the model is not a cleared clinical device, and reported gains are benchmark-based; real-world deployment would require prospective validation, but MedRegA provides a reproducible foundation and a reusable region-annotated dataset for the community.
Wang, L., et al. (2024) Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks. International Conference on Learning Representations.
DOI: 10.48550/arXiv.2410.18387Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data