University of Oxford / National University of Singapore
SAM2-based foundation model that segments 2D and 3D medical images by treating volumes and image sets as video object tracking.
Medical SAM 2 (MedSAM-2) is a promptable segmentation foundation model that adapts Meta's Segment Anything Model 2 (SAM 2) to medical imaging. Its central idea is to reframe both 2D and 3D medical segmentation as a video object tracking problem: a 3D volume is processed as a sequence of frames, and even an unordered collection of unrelated 2D images can be treated as a pseudo-video. This unifies what are usually separate 2D and 3D segmentation pipelines under a single architecture and inference paradigm.
The model was developed by Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu at the University of Oxford and the National University of Singapore, with the work first posted to arXiv in August 2024. It addresses a persistent limitation of SAM-style models in medicine: out-of-the-box SAM and SAM 2 perform inconsistently on medical modalities such as CT, MRI, ultrasound, and fundus imaging, and propagating a single prompt across an entire 3D scan or image set is non-trivial.
MedSAM-2 sits alongside other SAM derivatives for medicine (such as MedSAM v1, SAM-Med2D, and the later bowang-lab MedSAM2) but is distinguished by its memory-driven, tracking-based formulation and its "One-Prompt Segmentation" capability, which lets a single annotated example drive segmentation of many subsequent images.
MedSAM-2 builds directly on the SAM 2 backbone, which couples a Hiera image encoder with a memory attention module and a streaming memory bank. The authors fine-tune this pipeline for medical data and replace the default temporal memory with their self-sorting memory bank, which curates the most informative embeddings to condition predictions on later frames. Released checkpoints are fine-tuned with public datasets including REFUGE (optic cup, 2D fundus) and BTCV (abdominal multi-organ, 3D CT). Evaluation spans roughly 25 segmentation tasks across more than a dozen benchmarks, covering abdominal organs, kidney and liver tumors, breast and nasopharynx cancer, vestibular schwannoma, mediastinal lymph nodes, cerebral and coronary arteries, white blood cells, retinal vessels, and mandibles. Across these 2D and 3D tasks the paper reports state-of-the-art or competitive Dice scores relative to SAM, SAM 2, MedSAM, and specialist segmentation networks.
MedSAM-2 targets researchers and clinicians who need to delineate anatomical structures, lesions, and tumors across diverse imaging modalities without training a bespoke model for each task. Its tracking-based design is well suited to volumetric annotation in radiology, where a clinician can prompt a single slice and propagate the contour through an entire CT or MRI scan, and its One-Prompt mode supports efficient batch labeling of large 2D image collections in pathology, ophthalmology, and microscopy. The model is most useful as an interactive annotation accelerator and a strong baseline for medical segmentation pipelines.
By recasting medical segmentation as video tracking and adding a memory bank tailored to non-temporal medical data, MedSAM-2 demonstrated that SAM 2 can be adapted into a general-purpose, promptable tool for both 2D and 3D imaging. Released openly with code and weights, it became a widely cited reference point in the rapidly growing family of SAM-based medical segmentation models and helped popularize the volume-as-video framing. As a preprint built on a fast-moving foundation, its reported benchmarks should be read as a snapshot, and like other interactive SAM derivatives it still depends on user prompts and can struggle on modalities far from its fine-tuning data.
Zhu, J., et al. (2024) Medical SAM 2: Segment medical images as video via Segment Anything Model 2. arXiv.org.
DOI: 10.48550/arXiv.2408.00874Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data