MIMO

Medical vision-language model that takes visual prompts on an image and returns answers grounded in pixel-level segmentation masks.

Released: October 2025

Parameters: 7 Billion

Most medical vision-language models share two practical limitations. They condition only on text instructions, ignoring the visual cues a clinician would naturally point to ("what is this mass?"), and they return free-text answers that are disconnected from the specific image regions they describe. This makes it hard to verify whether a model's reasoning is actually anchored to the right anatomy. MIMO (Medical vision language model with visual referring Multimodal Input and pixel grounding Multimodal Output) targets both gaps in a single architecture.

MIMO was developed by researchers at Peking University (Schools of Software and Microelectronics and of Computer Science, with affiliated hospitals and the National Engineering Research Center for Software Engineering) together with collaborators at Augusta University, and presented at CVPR 2025 with an accompanying preprint posted in October 2025. It lets users combine visual prompts (points, boxes, or scribbles on the image) with textual instructions to interrogate complex medical images, and it grounds the medical terminology in its answers to exact pixel-level locations via segmentation masks.

Positioned within the fast-growing family of medical multimodal LLMs, MIMO's contribution is the tight coupling of visual referring input and pixel-grounded output in one instruction-following system, rather than treating visual question answering and segmentation as separate models. The authors release model code and a large supporting dataset to enable this combined input/output regime.

Key Features

Visual referring input: Users supply visual prompts (points, boxes, scribbles) alongside text, so the model attends to the exact region a clinician is asking about instead of relying on text descriptions alone.
Pixel-grounded output: Medical terms in the generated answer are tied to segmentation masks, making responses verifiable against specific image regions.
Unified multi-task inference: A single model performs language-guided segmentation, visual-prompt perception, grounded question answering, and visual-prompt-assisted QA without task-specific retraining.
Eight imaging modalities: Training spans CT, MRI, X-ray, ultrasound, PET, fundus, dermoscopy, and endoscopy, giving broad coverage across radiology and other clinical imaging.
Released checkpoint and dataset: Code and the large-scale MIMOSeg dataset are made available to support the visual-referring, pixel-grounding setting.

Technical Details

MIMO couples a CLIP ViT-H/14 vision encoder with a Vicuna-7B (LLaMA-based) language model and a SAM (Segment Anything Model) encoder/decoder for mask generation. A Multi-modal Input Aligner extracts instruction-guided information from the fused visual and textual features so that visual prompts and text jointly steer the language model, while the SAM decoder produces the pixel-level masks that ground the output. The LLM is adapted with LoRA (α=8), and the system was trained for roughly 10–12 days on 4 A800 GPUs (Adam, learning rate 3e-4, batch size 40). To supervise the combined input/output regime, the team built MIMOSeg, a dataset of roughly 895,000 samples organized along four perspectives — language-guided segmentation (~255K), visual-prompt perception (~255K), QA with segmentation (~182K), and visual-prompt-assisted QA (~181K) — assembled from about one million pixel-level medical samples across eight imaging modalities. The authors report improvements over existing medical vision-language baselines on these grounded QA and segmentation tasks.

Applications

MIMO is aimed at interactive clinical and research workflows where a user wants to ask grounded questions about a specific structure in a scan. A radiologist could circle a lesion and ask the model to characterize it, receiving both a textual description and a segmentation mask delineating exactly what the model referred to; similar interactions apply to dermoscopy, endoscopy, fundus, and ultrasound images. Because answers are pixel-grounded, the model is better suited to verification and education than text-only assistants, and its multi-task design lets the same system support segmentation, perception, and question answering in a unified interface.

Impact

MIMO is an early example of bringing both visual-referring input and pixel-level grounded output together in one medical multimodal LLM, addressing a recurring trust problem with text-only medical assistants whose answers cannot be checked against the image. Its acceptance at CVPR 2025, the release of code and the 895K-sample MIMOSeg dataset, and coverage of eight imaging modalities give the broader community a concrete benchmark and resource for grounded medical visual-language modeling. As a recent 7B-parameter system built on Vicuna and SAM, its real-world clinical reliability and generalization beyond the curated MIMOSeg distribution remain to be validated independently, and no model card or HuggingFace deployment has been confirmed at the time of writing.

Citations

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Preprint

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2510.10011

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52734.2025.02303

Recent citations

Papers that recently cited this model.

Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care
Runwei Guan, Yi Zhou, Heyi Lin, et al.
Jun 2026
0
Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis
Sanket Kachole, Siddhesh P. Thakur, S. Innani, et al.
Jun 2026
0
MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models
Aofei Chang, Le Huang, A. Boyd, et al.
Jun 2026
0Influential

Top citations

The most-cited papers that cite this model.

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, et al.
arXiv.org · Sep 2025
18
Foundation Models in Biomedical Imaging: Turning Hype into Reality
A. Muneer, Kai Zhang, I. Hamdi, et al.
arXiv.org · Dec 2025
5
A lung CT vision foundation model facilitating disease diagnosis and medical imaging
Zebin Gao, Guoxun Zhang, Hengrui Liang, et al.
Nature Communications · Dec 2025
4
Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Guoxin Wang, Jun Zhao, Xinyi Liu, et al.
arXiv.org · Sep 2025
4
MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data
Mengmeng Zhang, Xiaoping Wu, Hao Luo, et al.
arXiv.org · Jan 2026
3

Citations

Total Citations27

Influential2

References99

GitHub

Stars12

Forks1

Open Issues7

Contributors1

Last Push1y ago

Fields of citing research

Computer Science100%
Medicine84%
Engineering12%
Environmental Science4%
Biology4%
Psychology4%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

11Closed

Usability — can I run it?7

Reproducibility — can I retrain it?13

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Visual referring input: Users supply visual prompts (points, boxes, scribbles) alongside text, so the model attends to the exact region a clinician is asking about instead of relying on text descriptions alone.

Pixel-grounded output: Medical terms in the generated answer are tied to segmentation masks, making responses verifiable against specific image regions.

Unified multi-task inference: A single model performs language-guided segmentation, visual-prompt perception, grounded question answering, and visual-prompt-assisted QA without task-specific retraining.

Eight imaging modalities: Training spans CT, MRI, X-ray, ultrasound, PET, fundus, dermoscopy, and endoscopy, giving broad coverage across radiology and other clinical imaging.

Released checkpoint and dataset: Code and the large-scale MIMOSeg dataset are made available to support the visual-referring, pixel-grounding setting.

Technical Details

Applications

Impact

Citations

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Preprint

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2510.10011

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52734.2025.02303

Recent citations

Papers that recently cited this model.

Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care

Runwei Guan, Yi Zhou, Heyi Lin, et al.

Jun 2026

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

Sanket Kachole, Siddhesh P. Thakur, S. Innani, et al.

Jun 2026

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

Aofei Chang, Le Huang, A. Boyd, et al.

Jun 2026

0Influential

MIMO

#Key Features

#Technical Details

#Applications

#Impact

Citations

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Recent citations

Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MIMO

#Key Features

#Technical Details

#Applications

#Impact

Citations

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Recent citations

Bridging the Post-discharge Gap: A Traceable Multi-agent Framework for Safe and Continuous Care

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact