Peking University / Augusta University
A medical vision-language model that accepts visual-referring multimodal input and produces pixel-grounded multimodal output, jointly answering and segmenting medical images.
Most medical vision-language models share two practical limitations. They condition only on text instructions, ignoring the visual cues a clinician would naturally point to ("what is this mass?"), and they return free-text answers that are disconnected from the specific image regions they describe. This makes it hard to verify whether a model's reasoning is actually anchored to the right anatomy. MIMO (Medical vision language model with visual referring Multimodal Input and pixel grounding Multimodal Output) targets both gaps in a single architecture.
MIMO was developed by researchers at Peking University (Schools of Software and Microelectronics and of Computer Science, with affiliated hospitals and the National Engineering Research Center for Software Engineering) together with collaborators at Augusta University, and presented at CVPR 2025 with an accompanying preprint posted in October 2025. It lets users combine visual prompts (points, boxes, or scribbles on the image) with textual instructions to interrogate complex medical images, and it grounds the medical terminology in its answers to exact pixel-level locations via segmentation masks.
Positioned within the fast-growing family of medical multimodal LLMs, MIMO's contribution is the tight coupling of visual referring input and pixel-grounded output in one instruction-following system, rather than treating visual question answering and segmentation as separate models. The authors release model code and a large supporting dataset to enable this combined input/output regime.
MIMO couples a CLIP ViT-H/14 vision encoder with a Vicuna-7B (LLaMA-based) language model and a SAM (Segment Anything Model) encoder/decoder for mask generation. A Multi-modal Input Aligner extracts instruction-guided information from the fused visual and textual features so that visual prompts and text jointly steer the language model, while the SAM decoder produces the pixel-level masks that ground the output. The LLM is adapted with LoRA (α=8), and the system was trained for roughly 10–12 days on 4 A800 GPUs (Adam, learning rate 3e-4, batch size 40). To supervise the combined input/output regime, the team built MIMOSeg, a dataset of roughly 895,000 samples organized along four perspectives — language-guided segmentation (~255K), visual-prompt perception (~255K), QA with segmentation (~182K), and visual-prompt-assisted QA (~181K) — assembled from about one million pixel-level medical samples across eight imaging modalities. The authors report improvements over existing medical vision-language baselines on these grounded QA and segmentation tasks.
MIMO is aimed at interactive clinical and research workflows where a user wants to ask grounded questions about a specific structure in a scan. A radiologist could circle a lesion and ask the model to characterize it, receiving both a textual description and a segmentation mask delineating exactly what the model referred to; similar interactions apply to dermoscopy, endoscopy, fundus, and ultrasound images. Because answers are pixel-grounded, the model is better suited to verification and education than text-only assistants, and its multi-task design lets the same system support segmentation, perception, and question answering in a unified interface.
MIMO is an early example of bringing both visual-referring input and pixel-level grounded output together in one medical multimodal LLM, addressing a recurring trust problem with text-only medical assistants whose answers cannot be checked against the image. Its acceptance at CVPR 2025, the release of code and the 895K-sample MIMOSeg dataset, and coverage of eight imaging modalities give the broader community a concrete benchmark and resource for grounded medical visual-language modeling. As a recent 7B-parameter system built on Vicuna and SAM, its real-world clinical reliability and generalization beyond the curated MIMOSeg distribution remain to be validated independently, and no model card or HuggingFace deployment has been confirmed at the time of writing.
Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.
DOI: 10.48550/arXiv.2510.10011Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.
DOI: 10.1109/CVPR52734.2025.02303Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data