Bowang Lab / University Health Network / University of Toronto / Vector Institute / Western University / New York University / Yale University
A promptable foundation model for universal medical image segmentation, fine-tuned from SAM on 1.57M image-mask pairs spanning 10 imaging modalities and 30+ cancer types.
MedSAM is a promptable foundation model for universal medical image segmentation, developed by Jun Ma, Bo Wang, and colleagues at the University Health Network, University of Toronto, the Vector Institute, and collaborating institutions, and published in Nature Communications in January 2024. It adapts Meta AI's Segment Anything Model (SAM) — a general-purpose natural-image segmentation model — to the medical domain, where SAM's zero-shot performance is unreliable because biomedical images differ sharply from web photographs in contrast, texture, and object boundaries.
Image segmentation is a foundational step in nearly every quantitative medical imaging workflow, from delineating tumors for radiotherapy planning to measuring organ volumes. Historically this required a separate specialist model trained per task and modality, each demanding large annotated datasets and brittle to distribution shift. MedSAM instead provides a single model that segments arbitrary structures across modalities when given a bounding-box prompt, collapsing dozens of task-specific pipelines into one interactive tool.
By training on the largest and most diverse medical segmentation corpus assembled at the time, MedSAM demonstrated that the prompt-driven, foundation-model paradigm transfers to medicine. It has become one of the most widely adopted reference points for promptable segmentation in biomedical imaging and seeded a family of follow-up work.
MedSAM retains SAM's three-part architecture: a ViT-Base image encoder, a prompt encoder, and a lightweight mask decoder. It was initialized from pretrained SAM weights; during fine-tuning the prompt encoder was frozen while the image encoder and mask decoder were updated, and only bounding-box prompts were used to keep the interface simple and clinically practical. Training used 1,570,263 image-mask pairs curated from publicly available sources, spanning 10 modalities and over 30 cancer types, on 20 A100 GPUs. Evaluation covered 86 internal and 60 external segmentation tasks. Across these, MedSAM substantially outperformed the original SAM (improvements ranging from roughly 15% to over 50% on hard tasks such as nasopharynx cancer) and was competitive with or superior to specialist segmentation networks while generalizing far better to out-of-distribution data.
MedSAM serves radiologists, pathologists, and medical imaging researchers as a general-purpose segmentation backbone. Typical uses include accelerating manual annotation of tumors and organs, generating training labels for downstream specialist models, supporting radiotherapy and surgical planning, and providing a strong baseline for new segmentation benchmarks. Because it accepts simple box prompts, it integrates cleanly into interactive annotation tools and clinical research pipelines without per-task retraining.
MedSAM was among the first works to demonstrate that the promptable foundation-model paradigm transfers effectively to medical imaging, and it has been heavily cited and adopted as a reference baseline across the field. Its public code and weights catalyzed a wave of follow-up models — including video- and 3D-oriented successors such as MedSAM2 — and helped establish interactive, prompt-driven segmentation as a practical standard for biomedical image analysis. Limitations remain: the model depends on a human-supplied prompt rather than fully automatic detection, performs best on well-bounded structures, and inherits coverage gaps from its training modalities, leaving fully automated and 3D/temporal segmentation as active areas of extension.
Ma, J., et al. (2023) Segment anything in medical images. Nature Communications.
DOI: 10.1038/s41467-024-44824-zPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data