bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

MIMO

Peking University / Augusta University

A medical vision-language model that accepts visual-referring multimodal input and produces pixel-grounded multimodal output, jointly answering and segmenting medical images.

Released: October 2025
Parameters: 7 Billion

Most medical vision-language models share two practical limitations. They condition only on text instructions, ignoring the visual cues a clinician would naturally point to ("what is this mass?"), and they return free-text answers that are disconnected from the specific image regions they describe. This makes it hard to verify whether a model's reasoning is actually anchored to the right anatomy. MIMO (Medical vision language model with visual referring Multimodal Input and pixel grounding Multimodal Output) targets both gaps in a single architecture.

MIMO was developed by researchers at Peking University (Schools of Software and Microelectronics and of Computer Science, with affiliated hospitals and the National Engineering Research Center for Software Engineering) together with collaborators at Augusta University, and presented at CVPR 2025 with an accompanying preprint posted in October 2025. It lets users combine visual prompts (points, boxes, or scribbles on the image) with textual instructions to interrogate complex medical images, and it grounds the medical terminology in its answers to exact pixel-level locations via segmentation masks.

Positioned within the fast-growing family of medical multimodal LLMs, MIMO's contribution is the tight coupling of visual referring input and pixel-grounded output in one instruction-following system, rather than treating visual question answering and segmentation as separate models. The authors release model code and a large supporting dataset to enable this combined input/output regime.

#Key Features

  • Visual referring input: Users supply visual prompts (points, boxes, scribbles) alongside text, so the model attends to the exact region a clinician is asking about instead of relying on text descriptions alone.
  • Pixel-grounded output: Medical terms in the generated answer are tied to segmentation masks, making responses verifiable against specific image regions.
  • Unified multi-task inference: A single model performs language-guided segmentation, visual-prompt perception, grounded question answering, and visual-prompt-assisted QA without task-specific retraining.
  • Eight imaging modalities: Training spans CT, MRI, X-ray, ultrasound, PET, fundus, dermoscopy, and endoscopy, giving broad coverage across radiology and other clinical imaging.
  • Released checkpoint and dataset: Code and the large-scale MIMOSeg dataset are made available to support the visual-referring, pixel-grounding setting.

#Technical Details

MIMO couples a CLIP ViT-H/14 vision encoder with a Vicuna-7B (LLaMA-based) language model and a SAM (Segment Anything Model) encoder/decoder for mask generation. A Multi-modal Input Aligner extracts instruction-guided information from the fused visual and textual features so that visual prompts and text jointly steer the language model, while the SAM decoder produces the pixel-level masks that ground the output. The LLM is adapted with LoRA (α=8), and the system was trained for roughly 10–12 days on 4 A800 GPUs (Adam, learning rate 3e-4, batch size 40). To supervise the combined input/output regime, the team built MIMOSeg, a dataset of roughly 895,000 samples organized along four perspectives — language-guided segmentation (~255K), visual-prompt perception (~255K), QA with segmentation (~182K), and visual-prompt-assisted QA (~181K) — assembled from about one million pixel-level medical samples across eight imaging modalities. The authors report improvements over existing medical vision-language baselines on these grounded QA and segmentation tasks.

#Applications

MIMO is aimed at interactive clinical and research workflows where a user wants to ask grounded questions about a specific structure in a scan. A radiologist could circle a lesion and ask the model to characterize it, receiving both a textual description and a segmentation mask delineating exactly what the model referred to; similar interactions apply to dermoscopy, endoscopy, fundus, and ultrasound images. Because answers are pixel-grounded, the model is better suited to verification and education than text-only assistants, and its multi-task design lets the same system support segmentation, perception, and question answering in a unified interface.

#Impact

MIMO is an early example of bringing both visual-referring input and pixel-level grounded output together in one medical multimodal LLM, addressing a recurring trust problem with text-only medical assistants whose answers cannot be checked against the image. Its acceptance at CVPR 2025, the release of code and the 895K-sample MIMOSeg dataset, and coverage of eight imaging modalities give the broader community a concrete benchmark and resource for grounded medical visual-language modeling. As a recent 7B-parameter system built on Vicuna and SAM, its real-world clinical reliability and generalization beyond the curated MIMOSeg distribution remain to be validated independently, and no model card or HuggingFace deployment has been confirmed at the time of writing.

Citations

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Preprint

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2510.10011

MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output

Chen, Y., et al. (2025) MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52734.2025.02303

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations23
Influential2
References99

GitHub

Stars12
Forks1
Open Issues7
Contributors1
Last Push1y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
11Closed
Usability — can I run it?7
Reproducibility — can I retrain it?13
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

histologyinstruction_tuningmultimodalradiologysegmentationtransformervision_transformervisual_groundingvisual_question_answering

Resources

GitHub RepositoryResearch Paper