Hong Kong University of Science and Technology / Weill Cornell Medicine / Harvard University
Universal foundation model that jointly generates diagnostic text and segments the corresponding targets across ten biomedical imaging modalities.
UniBiomed is a universal foundation model for grounded biomedical image interpretation: rather than producing a free-text finding or a segmentation mask in isolation, it generates a diagnostic description and simultaneously delineates the exact image regions that justify each finding. This pairing of textual reasoning with pixel-level evidence directly targets the interpretability gap that has limited clinical trust in medical-imaging AI, where a prediction is only actionable if a clinician can see what the model is looking at.
The model was developed by Linshan Wu, Hao Chen, and colleagues at The Hong Kong University of Science and Technology (HKUST), with collaborators at Weill Cornell Medicine and Harvard University, and first released as a preprint on arXiv in April 2025. It is positioned as the first model to unify grounded interpretation across the breadth of biomedical imaging, succeeding modality-specific or task-specific systems such as BiomedParse, LISA, and MedPLIB.
UniBiomed's central claim is generality. A single set of weights spans ten imaging modalities and five task families—segmentation, disease recognition, region-aware diagnosis, visual question answering, and report generation—removing the need for clinicians to pre-diagnose images or hand-craft textual and visual prompts before analysis.
UniBiomed combines a multi-modal large language model (MLLM) with the Segment Anything Model (SAM): the MLLM produces diagnostic text and emits grounding tokens that drive SAM to segment the referenced anatomy or lesion, allowing diverse tasks to be cast in a single universal training objective. Training used a curated corpus of over 27 million image–region–text triplets spanning the ten modalities. The authors validated the model on 84 datasets (70 internal and 14 external), reporting state-of-the-art results across all five task families. Reported gains include a 10.25% Dice improvement over BiomedParse on 60 segmentation datasets, +3.86% Dice and +3.29% accuracy over LISA on grounded disease recognition, +8.32% ROI-classification accuracy over MedPLIB, and region-aware report-generation scores of 52.4 BLEU-1, 30.4 METEOR, and 47.9 ROUGE-L.
UniBiomed is aimed at clinical and research workflows where an interpretable, evidence-linked output matters more than a bare prediction. Radiologists and pathologists can use it to draft region-grounded reports, flag and outline suspicious findings, and answer visual questions about a study, while researchers benefit from a single backbone that generalizes across modalities instead of maintaining separate segmentation and reporting models. Because every textual finding is tied to a mask, downstream reviewers can quickly verify or contest the model's reasoning.
By unifying grounded segmentation and diagnostic language across ten modalities in one openly released model, UniBiomed offers a template for interpretable, general-purpose medical-imaging AI and a strong baseline for grounded interpretation that subsequent multimodal clinical models can build on. As a preprint with released weights and data, its real-world clinical value still awaits prospective validation, and—like other large multimodal medical models—its outputs require expert oversight before any diagnostic use.
Wu, L., et al. (2025) UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation. arXiv.org.
DOI: 10.48550/arXiv.2504.21336Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data