MedPLIB

Baidu / China Agricultural University / Chinese Academy of Sciences / Peking University

Biomedical multimodal LLM that answers questions about medical images and returns pixel-level segmentation masks, using a mixture-of-experts design.

Released: December 2024

Parameters: 12 Billion

MedPLIB (Medical Pixel-Level Insight for Biomedicine) is a biomedical multimodal large language model (MLLM) that extends conversational image understanding down to the pixel level. Where most biomedical MLLMs operate only at the whole-image scale — answering questions about an entire scan or slide — MedPLIB additionally accepts pixel-level prompts (points, bounding boxes, and free-form shapes) and produces pixel-level grounding, returning segmentation masks that localize the structures it describes. This bridges the gap between high-level visual question answering and the fine-grained spatial reasoning that clinical interpretation often requires.

The model was introduced in December 2024 in the paper "Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine" and was accepted to AAAI 2025. It was developed in a collaboration led by Baidu, with contributors from China Agricultural University, the Chinese Academy of Sciences, and Peking University. Alongside the model, the authors released MeCoVQA (Medical Complex Vision Question Answering), a dataset of roughly 310,000 question-answer pairs spanning eight medical imaging modalities and covering complex, grounding-oriented instructions.

A central contribution is the use of a mixture-of-experts (MoE) design that lets a single model serve both broad language-and-vision dialogue and precise pixel grounding without paying the full cost of a monolithic model at inference time.

Key Features

Pixel-level grounding: Returns segmentation masks tied to natural-language queries, so the model can both name and spatially localize anatomical structures and findings in medical images.
Pixel-level prompting: Accepts user-supplied points, bounding boxes, and arbitrary shapes as visual prompts, enabling region-specific questions rather than whole-image-only interaction.
Mixture-of-experts routing: Coordinates a visual-language expert and a pixel-grounding expert; only a single expert is activated per query, keeping inference cost close to a 7B model despite a ~12B total parameter footprint.
Multi-stage training: A staged curriculum first trains the experts on their respective skills, then aligns them, avoiding the task interference common when one model is jointly trained on dialogue and segmentation.
MeCoVQA dataset: A purpose-built corpus of ~310K complex VQA and grounding pairs across eight modalities, released to support training and evaluation of pixel-aware biomedical MLLMs.

Technical Details

MedPLIB couples a CLIP-ViT-Large/14-336 vision encoder and a LLaMA-7B language backbone with a SAM-Med2D-based segmentation module for mask generation. The mixture-of-experts architecture totals roughly 12B parameters but activates a single expert at inference, giving an effective ~7B inference cost. Training follows a multi-stage strategy that separately develops the visual-language and pixel-grounding experts before coordinating them, mitigating cross-task interference. On zero-shot pixel grounding, MedPLIB reports a margin of about 19.7 mDice points over comparable small models and roughly 15.6 mDice points over larger models, alongside competitive medical visual question answering performance. Code, the MeCoVQA dataset, and model checkpoints (released as MedPLIB-7b-2e) are publicly available.

Applications

MedPLIB targets workflows where clinicians and researchers need to both query and localize content in medical images across modalities such as CT, MRI, X-ray, pathology, and others. Useful tasks include answering region-specific questions about a scan, grounding a described finding to an explicit segmentation mask, and supporting interactive, prompt-driven inspection of medical images. These capabilities are relevant to radiology and pathology assistance, medical education, dataset annotation, and as a building block for downstream biomedical vision-language systems.

Impact

MedPLIB is among the first biomedical MLLMs to integrate conversational understanding with pixel-level grounding in a unified, efficiency-conscious model, and it pairs that capability with an openly released model, code, and the MeCoVQA dataset. By demonstrating that a mixture-of-experts strategy can serve both dialogue and segmentation at single-expert inference cost, it offers a practical template for fine-grained, multimodal biomedical AI. As with most medical foundation models, outputs require expert validation and the model is not intended for autonomous clinical decision-making; reported gains are benchmark results that warrant prospective evaluation before clinical use.

Citation

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Preprint

Huang, X., et al. (2024) Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine. AAAI Conference on Artificial Intelligence.

DOI: 10.48550/arXiv.2412.09278

Recent citations

Papers that recently cited this model.

Aloe-Vision: Robust Vision-Language Models for Healthcare
Jaume Guasch-Martí, Enrique Lopez-Cuena, Martín Suárez-Fernández, et al.
Jun 2026
0
MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models
Aofei Chang, Le Huang, A. Boyd, et al.
Jun 2026
0
MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration
Jiahui Peng, He Yao, Jingwen Li, et al.
Apr 2026
0Influential

Top citations

The most-cited papers that cite this model.

MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Chunzheng Zhu, Yangfang Lin, Sheng Chen, et al.
AAAI Conference on Artificial Intelligence · Nov 2025
24
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Shengyuan Liu, Boyun Zheng, Wenting Chen, et al.
May 2025
15
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding
Jingkun Yue, Siqi Zhang, Zinan Jia, et al.
arXiv.org · May 2025
9
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
Linshan Wu, Yuxiang Nie, Sunan He, et al.
arXiv.org · Apr 2025
9Influential
Bridging modalities with AI: a review of AI advances in multimodal biomedical imaging
Le Minh Thao Doan, Kaveh Shahhosseini, Suraj Verma, et al.
Communications Engineer · Feb 2026
7

Citations

Total Citations39

Influential6

References42

GitHub

Stars134

Forks9

Open Issues11

Contributors2

Last Push18d ago

LanguagePython

HuggingFace

Downloads10

Likes2

Last Modified1y ago

Fields of citing research

Computer Science100%
Medicine91%
Engineering17%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

80Open

Usability — can I run it?91

Reproducibility — can I retrain it?65

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Pixel-level grounding: Returns segmentation masks tied to natural-language queries, so the model can both name and spatially localize anatomical structures and findings in medical images.

Pixel-level prompting: Accepts user-supplied points, bounding boxes, and arbitrary shapes as visual prompts, enabling region-specific questions rather than whole-image-only interaction.

Mixture-of-experts routing: Coordinates a visual-language expert and a pixel-grounding expert; only a single expert is activated per query, keeping inference cost close to a 7B model despite a ~12B total parameter footprint.

Multi-stage training: A staged curriculum first trains the experts on their respective skills, then aligns them, avoiding the task interference common when one model is jointly trained on dialogue and segmentation.

MeCoVQA dataset: A purpose-built corpus of ~310K complex VQA and grounding pairs across eight modalities, released to support training and evaluation of pixel-aware biomedical MLLMs.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Aloe-Vision: Robust Vision-Language Models for Healthcare

Jaume Guasch-Martí, Enrique Lopez-Cuena, Martín Suárez-Fernández, et al.

Jun 2026

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

Aofei Chang, Le Huang, A. Boyd, et al.

Jun 2026

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Jiahui Peng, He Yao, Jingwen Li, et al.

Apr 2026

0Influential

Top citations

The most-cited papers that cite this model.

MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis

Chunzheng Zhu, Yangfang Lin, Sheng Chen, et al.

AAAI Conference on Artificial Intelligence · Nov 2025

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu, Boyun Zheng, Wenting Chen, et al.

May 2025

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue, Siqi Zhang, Zinan Jia, et al.

arXiv.org · May 2025

UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

Linshan Wu, Yuxiang Nie, Sunan He, et al.

arXiv.org · Apr 2025

9Influential

Bridging modalities with AI: a review of AI advances in multimodal biomedical imaging

Le Minh Thao Doan, Kaveh Shahhosseini, Suraj Verma, et al.

Communications Engineer · Feb 2026

MedPLIB

#Key Features

#Technical Details

#Applications

#Impact

Citation

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Recent citations

Aloe-Vision: Robust Vision-Language Models for Healthcare

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Top citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MedPLIB

#Key Features

#Technical Details

#Applications

#Impact

Citation

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Recent citations

Aloe-Vision: Robust Vision-Language Models for Healthcare

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Top citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact