bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

MedPLIB

Baidu / China Agricultural University / Chinese Academy of Sciences / Peking University

Biomedical multimodal LLM with pixel-level insight, combining visual question answering, pixel-grounded prompts, and segmentation via a mixture-of-experts design.

Released: December 2024
Parameters: 12 Billion

MedPLIB (Medical Pixel-Level Insight for Biomedicine) is a biomedical multimodal large language model (MLLM) that extends conversational image understanding down to the pixel level. Where most biomedical MLLMs operate only at the whole-image scale — answering questions about an entire scan or slide — MedPLIB additionally accepts pixel-level prompts (points, bounding boxes, and free-form shapes) and produces pixel-level grounding, returning segmentation masks that localize the structures it describes. This bridges the gap between high-level visual question answering and the fine-grained spatial reasoning that clinical interpretation often requires.

The model was introduced in December 2024 in the paper "Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine" and was accepted to AAAI 2025. It was developed in a collaboration led by Baidu, with contributors from China Agricultural University, the Chinese Academy of Sciences, and Peking University. Alongside the model, the authors released MeCoVQA (Medical Complex Vision Question Answering), a dataset of roughly 310,000 question-answer pairs spanning eight medical imaging modalities and covering complex, grounding-oriented instructions.

A central contribution is the use of a mixture-of-experts (MoE) design that lets a single model serve both broad language-and-vision dialogue and precise pixel grounding without paying the full cost of a monolithic model at inference time.

#Key Features

  • Pixel-level grounding: Returns segmentation masks tied to natural-language queries, so the model can both name and spatially localize anatomical structures and findings in medical images.
  • Pixel-level prompting: Accepts user-supplied points, bounding boxes, and arbitrary shapes as visual prompts, enabling region-specific questions rather than whole-image-only interaction.
  • Mixture-of-experts routing: Coordinates a visual-language expert and a pixel-grounding expert; only a single expert is activated per query, keeping inference cost close to a 7B model despite a ~12B total parameter footprint.
  • Multi-stage training: A staged curriculum first trains the experts on their respective skills, then aligns them, avoiding the task interference common when one model is jointly trained on dialogue and segmentation.
  • MeCoVQA dataset: A purpose-built corpus of ~310K complex VQA and grounding pairs across eight modalities, released to support training and evaluation of pixel-aware biomedical MLLMs.

#Technical Details

MedPLIB couples a CLIP-ViT-Large/14-336 vision encoder and a LLaMA-7B language backbone with a SAM-Med2D-based segmentation module for mask generation. The mixture-of-experts architecture totals roughly 12B parameters but activates a single expert at inference, giving an effective ~7B inference cost. Training follows a multi-stage strategy that separately develops the visual-language and pixel-grounding experts before coordinating them, mitigating cross-task interference. On zero-shot pixel grounding, MedPLIB reports a margin of about 19.7 mDice points over comparable small models and roughly 15.6 mDice points over larger models, alongside competitive medical visual question answering performance. Code, the MeCoVQA dataset, and model checkpoints (released as MedPLIB-7b-2e) are publicly available.

#Applications

MedPLIB targets workflows where clinicians and researchers need to both query and localize content in medical images across modalities such as CT, MRI, X-ray, pathology, and others. Useful tasks include answering region-specific questions about a scan, grounding a described finding to an explicit segmentation mask, and supporting interactive, prompt-driven inspection of medical images. These capabilities are relevant to radiology and pathology assistance, medical education, dataset annotation, and as a building block for downstream biomedical vision-language systems.

#Impact

MedPLIB is among the first biomedical MLLMs to integrate conversational understanding with pixel-level grounding in a unified, efficiency-conscious model, and it pairs that capability with an openly released model, code, and the MeCoVQA dataset. By demonstrating that a mixture-of-experts strategy can serve both dialogue and segmentation at single-expert inference cost, it offers a practical template for fine-grained, multimodal biomedical AI. As with most medical foundation models, outputs require expert validation and the model is not intended for autonomous clinical decision-making; reported gains are benchmark results that warrant prospective evaluation before clinical use.

Citation

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Preprint

Huang, X., et al. (2024) Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine. AAAI Conference on Artificial Intelligence.

DOI: 10.48550/arXiv.2412.09278

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations33
Influential4
References42

GitHub

Stars130
Forks9
Open Issues11
Contributors1
Last Push1y ago
LanguagePython

HuggingFace

Downloads7
Likes2
Last Modified11mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
80Open
Usability — can I run it?91
Reproducibility — can I retrain it?65
Model Openness Framework
Unclassified
Missing required components

Tags

histologymedical_image_groundingmixture_of_expertsmulti_taskmultimodalradiologysegmentationtransformervision_transformervisual_question_answeringzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace Model