MedRegA

Hong Kong University of Science and Technology / Sun Yat-sen University

Region-aware bilingual medical multimodal LLM that handles image- and region-level vision-language tasks across eight imaging modalities.

Released: October 2024

Parameters: 40 Billion

MedRegA is a region-aware, bilingual (Chinese-English) medical multimodal large language model designed to handle a broad spectrum of biomedical vision-language tasks within a single generalist system. Most medical MLLMs reason over a whole image at once, which makes their outputs difficult to interpret and prone to overlooking the small anatomical structures or lesions that drive clinical decisions. MedRegA addresses this by explicitly grounding its reasoning in image regions, mimicking the clinical workflow in which a radiologist surveys an entire scan and then focuses attention on specific areas before reaching a conclusion.

The model was developed by Lehan Wang, Haonan Wang, Honglong Yang, and Xiaomeng Li of the Hong Kong University of Science and Technology, together with radiologist collaborators Jiaji Mao, Zehong Yang, and Jun Shen from Sun Yat-sen Memorial Hospital, Sun Yat-sen University. It was first released as a preprint in October 2024 and accepted to ICLR 2025.

To train region-aware behavior, the authors introduce MedRegInstruct, an instruction-tuning corpus in which samples are paired with the coordinates of body structures or lesions. This lets MedRegA serve as an interpretable generalist that can both answer questions about whole images and localize, identify, and report on specific anatomical regions across eight medical imaging modalities.

Key Features

Region-aware reasoning: MedRegA introduces three region-centric tasks — Region-to-Text Identification, Text-to-Region Detection, and Grounded Report Generation — that tie language outputs to explicit bounding-box coordinates for interpretable, localizable predictions.
Bilingual operation: The model handles both English and Chinese medical instructions, broadening its applicability to clinical and research settings in Chinese-speaking healthcare systems.
Generalist across modalities: A single model spans eight imaging modalities and multiple body parts, covering visual question answering, report generation, and image classification alongside the region tasks.
MedRegInstruct dataset: A large-scale instruction corpus pairing medical images with region coordinates, built from eight public sources including MIMIC-CXR, SA-Med-2D-20M, PanNuke, ISIC, and the VinDr family of datasets.
Open release: Code is MIT-licensed, with model weights and the MedRegInstruct dataset published on Hugging Face.

Technical Details

MedRegA is built on the InternVL-Chat-V1-2 backbone, a roughly 40-billion-parameter vision-language model that couples a vision transformer image encoder with a large language model decoder. The authors adapt this generalist foundation through instruction tuning on MedRegInstruct, which augments standard image-text supervision with explicit region coordinates so the model learns to attend to and describe localized structures. Training data is drawn from eight public medical imaging repositories spanning chest X-ray (MIMIC-CXR, VinDr-CXR, VinDr-PCXR), mammography (VinDr-Mammo), spine X-ray (VinDr-SpineXR), dermatology (ISIC), histopathology (PanNuke), and the large segmentation collection SA-Med-2D-20M. Across image-level and region-level benchmarks, MedRegA reports competitive or superior performance relative to general and medical MLLMs on visual question answering, report generation, medical image classification, and region detection, with the region-grounding capability providing interpretability that single-image models lack.

Applications

MedRegA targets clinical and research scenarios where both broad coverage and fine-grained localization matter. Radiologists and clinicians can use it to generate grounded reports in which findings are tied to specific image regions, to detect and identify anatomical structures or lesions, and to answer questions across diverse modalities — all from one model rather than a collection of task-specific tools. Its bilingual support makes it particularly relevant for Chinese-language clinical workflows, and the region-grounding output offers a degree of transparency useful for second-read assistance and education.

Impact

By coupling generalist breadth with explicit region grounding, MedRegA advances the interpretability of medical MLLMs, an area where opaque whole-image reasoning has limited clinical trust. Its acceptance at ICLR 2025 and the open release of code, weights, and the MedRegInstruct dataset lower the barrier for follow-on work on region-aware medical vision-language modeling. As with other research-stage medical MLLMs, the model is not a cleared clinical device, and reported gains are benchmark-based; real-world deployment would require prospective validation, but MedRegA provides a reproducible foundation and a reusable region-annotated dataset for the community.

Citation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Preprint

Wang, L., et al. (2024) Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2410.18387

Recent citations

Papers that recently cited this model.

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration
Jiahui Peng, He Yao, Jingwen Li, et al.
Apr 2026
0
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Yang Yu, Dunyuan Xu, Yaoqian Li, et al.
Apr 2026
0
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Y. Cheng, Runkai Zhao, Weidong Cai
Mar 2026
0

Top citations

The most-cited papers that cite this model.

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang, Kaichen Huang, Jiahao Huo, et al.
arXiv.org · Dec 2024
74
Multimodal Large Language Models in Medical Imaging: Current State and Future Directions
Yoojin Nam, Dong Yeong Kim, Sunggu Kyung, et al.
Korean Journal of Radiology · Aug 2025
60
Large Language Model With Region-Guided Referring and Grounding for CT Report Generation
Zhixuan Chen, Yequan Bie, Haibo Jin, et al.
IEEE Transactions on Medical Imaging · Nov 2024
29
Token Activation Map to Visually Explain Multimodal LLMs
Yi Li, Hualiang Wang, Xinpeng Ding, et al.
IEEE International Conference on Computer Vision · Jun 2025
22
MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization
Huihui Xu, Yuanpeng Nie, Hualiang Wang, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Jul 2025
17

Citations

Total Citations29

Influential2

References63

GitHub

Stars46

Forks2

Open Issues3

Contributors1

Last Push9mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads13

Likes0

Last Modified1y ago

Pipelineimage-text-to-text

Fields of citing research

Computer Science100%
Medicine92%
Engineering27%
Linguistics4%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

65Partial

Usability — can I run it?71

Reproducibility — can I retrain it?57

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Region-aware reasoning: MedRegA introduces three region-centric tasks — Region-to-Text Identification, Text-to-Region Detection, and Grounded Report Generation — that tie language outputs to explicit bounding-box coordinates for interpretable, localizable predictions.

Bilingual operation: The model handles both English and Chinese medical instructions, broadening its applicability to clinical and research settings in Chinese-speaking healthcare systems.

Generalist across modalities: A single model spans eight imaging modalities and multiple body parts, covering visual question answering, report generation, and image classification alongside the region tasks.

MedRegInstruct dataset: A large-scale instruction corpus pairing medical images with region coordinates, built from eight public sources including MIMIC-CXR, SA-Med-2D-20M, PanNuke, ISIC, and the VinDr family of datasets.

Open release: Code is MIT-licensed, with model weights and the MedRegInstruct dataset published on Hugging Face.

Technical Details

Applications

Impact

Citation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Preprint

Wang, L., et al. (2024) Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2410.18387

Recent citations

Papers that recently cited this model.

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Jiahui Peng, He Yao, Jingwen Li, et al.

Apr 2026

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Yang Yu, Dunyuan Xu, Yaoqian Li, et al.

Apr 2026

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Y. Cheng, Runkai Zhao, Weidong Cai

Mar 2026

MedRegA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Recent citations

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MedRegA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Recent citations

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact