Baidu / China Agricultural University / Chinese Academy of Sciences / Peking University
Biomedical multimodal LLM with pixel-level insight, combining visual question answering, pixel-grounded prompts, and segmentation via a mixture-of-experts design.
MedPLIB (Medical Pixel-Level Insight for Biomedicine) is a biomedical multimodal large language model (MLLM) that extends conversational image understanding down to the pixel level. Where most biomedical MLLMs operate only at the whole-image scale — answering questions about an entire scan or slide — MedPLIB additionally accepts pixel-level prompts (points, bounding boxes, and free-form shapes) and produces pixel-level grounding, returning segmentation masks that localize the structures it describes. This bridges the gap between high-level visual question answering and the fine-grained spatial reasoning that clinical interpretation often requires.
The model was introduced in December 2024 in the paper "Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine" and was accepted to AAAI 2025. It was developed in a collaboration led by Baidu, with contributors from China Agricultural University, the Chinese Academy of Sciences, and Peking University. Alongside the model, the authors released MeCoVQA (Medical Complex Vision Question Answering), a dataset of roughly 310,000 question-answer pairs spanning eight medical imaging modalities and covering complex, grounding-oriented instructions.
A central contribution is the use of a mixture-of-experts (MoE) design that lets a single model serve both broad language-and-vision dialogue and precise pixel grounding without paying the full cost of a monolithic model at inference time.
MedPLIB couples a CLIP-ViT-Large/14-336 vision encoder and a LLaMA-7B language backbone with a SAM-Med2D-based segmentation module for mask generation. The mixture-of-experts architecture totals roughly 12B parameters but activates a single expert at inference, giving an effective ~7B inference cost. Training follows a multi-stage strategy that separately develops the visual-language and pixel-grounding experts before coordinating them, mitigating cross-task interference. On zero-shot pixel grounding, MedPLIB reports a margin of about 19.7 mDice points over comparable small models and roughly 15.6 mDice points over larger models, alongside competitive medical visual question answering performance. Code, the MeCoVQA dataset, and model checkpoints (released as MedPLIB-7b-2e) are publicly available.
MedPLIB targets workflows where clinicians and researchers need to both query and localize content in medical images across modalities such as CT, MRI, X-ray, pathology, and others. Useful tasks include answering region-specific questions about a scan, grounding a described finding to an explicit segmentation mask, and supporting interactive, prompt-driven inspection of medical images. These capabilities are relevant to radiology and pathology assistance, medical education, dataset annotation, and as a building block for downstream biomedical vision-language systems.
MedPLIB is among the first biomedical MLLMs to integrate conversational understanding with pixel-level grounding in a unified, efficiency-conscious model, and it pairs that capability with an openly released model, code, and the MeCoVQA dataset. By demonstrating that a mixture-of-experts strategy can serve both dialogue and segmentation at single-expert inference cost, it offers a practical template for fine-grained, multimodal biomedical AI. As with most medical foundation models, outputs require expert validation and the model is not intended for autonomous clinical decision-making; reported gains are benchmark results that warrant prospective evaluation before clinical use.
Huang, X., et al. (2024) Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine. AAAI Conference on Artificial Intelligence.
DOI: 10.48550/arXiv.2412.09278Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data