MIT CSAIL / Massachusetts General Hospital
Interactive foundation model for biomedical image segmentation, prompted with scribbles, clicks, and bounding boxes to segment unseen structures.
ScribblePrompt is an interactive segmentation foundation model for biomedical imaging that lets a user delineate anatomical or pathological structures by drawing scribbles, placing clicks, or dragging bounding boxes, rather than training a new model for each task. It addresses a persistent bottleneck in medical image analysis: manual annotation is slow and expensive, and task-specific models do not generalize to the long tail of structures, modalities, and acquisition protocols that clinicians and researchers actually encounter. By treating segmentation as a promptable, iterative process, the model produces accurate masks for structures and image types it never saw during training.
The model was developed by Hallee Wong, Marianne Rakic, John Guttag, and Adrian Dalca at MIT CSAIL, with clinical affiliation to Massachusetts General Hospital, and was presented at ECCV 2024 (preprint December 2023). It is part of a wave of promptable segmentation systems inspired by the Segment Anything Model (SAM), but is purpose-built for the heterogeneity of biomedical data, where SAM and similar natural-image models tend to underperform.
ScribblePrompt is released in two variants: ScribblePrompt-UNet, an efficient fully-convolutional network, and ScribblePrompt-SAM, which adapts the SAM architecture. Both are designed for fast, responsive inference so that a human can refine a prediction in real time.
ScribblePrompt is trained on a collection of 65 diverse biomedical imaging datasets spanning many modalities (including MRI, CT, ultrasound, X-ray, and microscopy), combining real labels with synthetically generated ones to broaden coverage. A central methodological contribution is the algorithm that simulates human interactions during training: it produces varied, realistic scribbles, clicks, and bounding boxes so the network learns to interpret partial, ambiguous, and iteratively refined prompts. ScribblePrompt-UNet uses an efficient fully-convolutional encoder-decoder, while ScribblePrompt-SAM fine-tunes the Segment Anything Model backbone. Evaluation across unseen datasets and a controlled user study showed it surpassing baselines including SAM and SAM-Med2D on accuracy while remaining fast enough for interactive use.
ScribblePrompt is aimed at researchers and clinicians who need to annotate or segment biomedical images at scale, such as building labeled datasets for downstream models, quantifying lesions or organs in research studies, or prototyping segmentation for a new modality without collecting task-specific training data. Its browser-based demo and lightweight UNet variant make it practical for labs without large compute budgets, and its interactive design fits naturally into human-in-the-loop annotation pipelines where an expert verifies and corrects each mask.
ScribblePrompt demonstrated that a single promptable model, trained with carefully simulated interactions, can serve as a general-purpose annotation tool across the fragmented landscape of biomedical imaging, where most prior work was narrowly task-specific. By releasing both model weights (Apache 2.0) and an interactive demo, the authors made the approach immediately usable, and the measured reductions in annotation time point to concrete value for dataset creation and clinical research. The accompanying MedScribble dataset of multi-annotator scribble annotations also provides a benchmark resource for the interactive-segmentation community. As a biomedical counterpart to SAM-style promptable models, it remains a reference point for scribble- and click-based medical image segmentation.
Wong, H. E., et al. (2023) ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image. European Conference on Computer Vision.
DOI: 10.1007/978-3-031-73661-2_12Wong, H. E., et al. (2023) ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image. European Conference on Computer Vision.
DOI: 10.48550/arXiv.2312.07381Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data