A biomedical foundation model for joint segmentation, detection, and recognition across nine imaging modalities using natural language prompts.
BiomedParse is a biomedical foundation model from Microsoft Research that unifies segmentation, detection, and recognition of anatomical structures and abnormalities across nine imaging modalities within a single framework. Rather than requiring manual bounding boxes or point clicks, BiomedParse accepts a natural language description of a target structure and returns a pixel-level segmentation mask — making it operable by any researcher who can describe what they are looking for. The model was published in Nature Methods in November 2024.
Most biomedical image analysis systems are modality-specific, task-specific, or require expert interaction through point and bounding-box prompts. BiomedParse addresses all three constraints simultaneously. Given a text prompt such as "liver" or "polyp", the model produces segmentation masks across CT, MRI, X-ray, pathology slides, endoscopy, ultrasound, fundus photography, dermoscopy, and optical coherence tomography (OCT) images — without any modality-specific fine-tuning.
A key insight behind the model is that training segmentation, detection, and recognition jointly produces mutual regularization: each task improves the others through shared representations. This joint learning formulation allows BiomedParse to learn more generalizable visual features than single-task baselines trained on the same data.
BiomedParse is built on the SEEM (Segment Everything Everywhere all at once with Multi-granularity) framework, extended with biomedical domain adaptations. The image encoder is a Focal Vision Transformer initialized from a pretrained checkpoint and fine-tuned on biomedical images. The text encoder is PubMedBERT, which provides biomedical domain-specific language representations for interpreting clinical terminology in prompts. A transformer-based mask decoder cross-attends image and text features to produce pixel-wise segmentation probability maps at the input image resolution. An auxiliary meta-object classifier is trained on 15 intermediate semantic categories (organ, abnormality, histology) to jointly train the image encoder with coarse object semantics alongside the fine-grained mask decoder.
Training data was constructed from 45 publicly available biomedical segmentation datasets and comprises 1.1 million images, 3.4 million image-mask-label triples, and 6.8 million image-mask-description triples after GPT-4 synthesis of synonymous descriptions for each label. On a held-out test set of 102,855 instances spanning all nine modalities and 64 major object types, BiomedParse achieved the highest Dice scores across all nine modalities compared to competing methods, with statistically significant improvement over MedSAM using oracle bounding boxes (p < 10^-4). On detection of irregular-shaped objects, it improved Dice by 39.6% over the best competing method. On recognition, it improved F1 by 74.5% relative to Grounding DINO.
BiomedParse is designed for researchers and clinical informaticists who need rapid, text-driven quantification of anatomical objects across large image cohorts without modality-specific tooling. In radiology, it enables automated organ and lesion segmentation in CT and MRI without drawing bounding boxes. In pathology, it segments cells, glands, and tissue structures from whole-slide image patches using plain-language descriptions. In ophthalmology, it handles retinal structure and lesion segmentation in fundus photographs and OCT volumes. It is equally applicable to dermatology (skin lesion delineation in dermoscopy) and endoscopy (polyp and mucosal abnormality segmentation). Beyond inference, BiomedParse can accelerate expert annotation workflows by generating initial segmentation masks from text descriptions that human annotators then refine.
BiomedParse represents a meaningful step toward general-purpose biomedical image analysis, demonstrating that a single model can match or exceed task-specific and modality-specific baselines when trained with a sufficiently diverse and well-harmonized dataset. Its release on HuggingFace under the Apache-2.0 license, alongside the full BiomedParseData training corpus, lowers the barrier for downstream fine-tuning and benchmarking. Notable limitations include 2D-only processing in the v1 model (3D volumetric inference requires slice-by-slice application), a closed object vocabulary of 82 trained types that may not generalize to novel structures, and sensitivity to the alignment between user prompts and the training ontology. The model has not undergone regulatory review and should not be used for clinical decision-making without further validation.
Zhao, T., et al. (2024) A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nature Methods.
DOI: 10.1038/s41592-024-02499-w