bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

MedRegA

Hong Kong University of Science and Technology / Sun Yat-sen University

Region-aware bilingual (Chinese-English) medical multimodal LLM that handles image- and region-level vision-language tasks across eight imaging modalities.

Released: October 2024
Parameters: 40 Billion

MedRegA is a region-aware, bilingual (Chinese-English) medical multimodal large language model designed to handle a broad spectrum of biomedical vision-language tasks within a single generalist system. Most medical MLLMs reason over a whole image at once, which makes their outputs difficult to interpret and prone to overlooking the small anatomical structures or lesions that drive clinical decisions. MedRegA addresses this by explicitly grounding its reasoning in image regions, mimicking the clinical workflow in which a radiologist surveys an entire scan and then focuses attention on specific areas before reaching a conclusion.

The model was developed by Lehan Wang, Haonan Wang, Honglong Yang, and Xiaomeng Li of the Hong Kong University of Science and Technology, together with radiologist collaborators Jiaji Mao, Zehong Yang, and Jun Shen from Sun Yat-sen Memorial Hospital, Sun Yat-sen University. It was first released as a preprint in October 2024 and accepted to ICLR 2025.

To train region-aware behavior, the authors introduce MedRegInstruct, an instruction-tuning corpus in which samples are paired with the coordinates of body structures or lesions. This lets MedRegA serve as an interpretable generalist that can both answer questions about whole images and localize, identify, and report on specific anatomical regions across eight medical imaging modalities.

#Key Features

  • Region-aware reasoning: MedRegA introduces three region-centric tasks — Region-to-Text Identification, Text-to-Region Detection, and Grounded Report Generation — that tie language outputs to explicit bounding-box coordinates for interpretable, localizable predictions.
  • Bilingual operation: The model handles both English and Chinese medical instructions, broadening its applicability to clinical and research settings in Chinese-speaking healthcare systems.
  • Generalist across modalities: A single model spans eight imaging modalities and multiple body parts, covering visual question answering, report generation, and image classification alongside the region tasks.
  • MedRegInstruct dataset: A large-scale instruction corpus pairing medical images with region coordinates, built from eight public sources including MIMIC-CXR, SA-Med-2D-20M, PanNuke, ISIC, and the VinDr family of datasets.
  • Open release: Code is MIT-licensed, with model weights and the MedRegInstruct dataset published on Hugging Face.

#Technical Details

MedRegA is built on the InternVL-Chat-V1-2 backbone, a roughly 40-billion-parameter vision-language model that couples a vision transformer image encoder with a large language model decoder. The authors adapt this generalist foundation through instruction tuning on MedRegInstruct, which augments standard image-text supervision with explicit region coordinates so the model learns to attend to and describe localized structures. Training data is drawn from eight public medical imaging repositories spanning chest X-ray (MIMIC-CXR, VinDr-CXR, VinDr-PCXR), mammography (VinDr-Mammo), spine X-ray (VinDr-SpineXR), dermatology (ISIC), histopathology (PanNuke), and the large segmentation collection SA-Med-2D-20M. Across image-level and region-level benchmarks, MedRegA reports competitive or superior performance relative to general and medical MLLMs on visual question answering, report generation, medical image classification, and region detection, with the region-grounding capability providing interpretability that single-image models lack.

#Applications

MedRegA targets clinical and research scenarios where both broad coverage and fine-grained localization matter. Radiologists and clinicians can use it to generate grounded reports in which findings are tied to specific image regions, to detect and identify anatomical structures or lesions, and to answer questions across diverse modalities — all from one model rather than a collection of task-specific tools. Its bilingual support makes it particularly relevant for Chinese-language clinical workflows, and the region-grounding output offers a degree of transparency useful for second-read assistance and education.

#Impact

By coupling generalist breadth with explicit region grounding, MedRegA advances the interpretability of medical MLLMs, an area where opaque whole-image reasoning has limited clinical trust. Its acceptance at ICLR 2025 and the open release of code, weights, and the MedRegInstruct dataset lower the barrier for follow-on work on region-aware medical vision-language modeling. As with other research-stage medical MLLMs, the model is not a cleared clinical device, and reported gains are benchmark-based; real-world deployment would require prospective validation, but MedRegA provides a reproducible foundation and a reusable region-annotated dataset for the community.

Citation

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Preprint

Wang, L., et al. (2024) Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks. International Conference on Learning Representations.

DOI: 10.48550/arXiv.2410.18387

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations26
Influential1
References63

GitHub

Stars45
Forks2
Open Issues3
Contributors1
Last Push7mo ago
LanguagePython
LicenseMIT

HuggingFace

Downloads14
Likes0
Last Modified1y ago
Pipelineimage-text-to-text

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
65Partial
Usability — can I run it?71
Reproducibility — can I retrain it?57
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

histologyimage_classificationinstruction_tuninglanguage_modelmultimodalobject_detectionradiologyreport_generationtransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset