Beijing Academy of Artificial Intelligence
A promptable 3D foundation model for volumetric CT segmentation of 200+ anatomical categories using point, box, and text prompts.
SegVol is a universal, interactive foundation model for volumetric medical image segmentation, developed by researchers at the Beijing Academy of Artificial Intelligence (BAAI) and collaborators and introduced in November 2023. It tackles a long-standing bottleneck in 3D medical imaging: most segmentation models are trained for a single organ or task, requiring a new model and labeled dataset for every structure of interest. SegVol instead provides one model that can segment more than 200 anatomical categories across whole CT volumes, driven by user-supplied prompts.
The model's central novelty is its support for both spatial and semantic prompting. Users can specify a target with point or bounding-box prompts (in the style of the Segment Anything Model, SAM) or with free-text prompts naming an anatomical structure, and SegVol returns a 3D mask for the corresponding region. This makes it a 3D, text-aware analogue to interactive 2D segmentation models, addressing the gap that earlier promptable models such as SAM were designed for natural 2D images rather than volumetric medical scans.
SegVol was accepted as a spotlight at NeurIPS 2024. By releasing pretrained weights, training code, and a large curated dataset, the authors positioned it as an accessible base model for radiology segmentation that researchers can apply zero-shot or fine-tune for specific clinical tasks.
SegVol couples a 3D Vision Transformer (ViT) image encoder with a SAM-style promptable segmentation decoder, plus a text encoder that maps anatomical names into the prompt space for semantic segmentation. The ViT encoder is pretrained for 2,000 epochs on about 90,000 unlabeled CT volumes via self-supervision, then supervised on roughly 6,000 labeled volumes drawn from 25 public datasets aggregated into the M3D-Seg collection (5,772 3D images and 149,196 mask annotations spanning datasets such as BTCV, AMOS22, TotalSegmentator, KiTS, CHAOS, and the Medical Segmentation Decathlon). On a benchmark of 22 anatomical segmentation tasks, SegVol outperforms competing methods on 19 of them, with improvements of up to 37.24% in Dice score over the runner-up. The zoom-out-zoom-in strategy is what makes whole-volume inference tractable while preserving fine boundary detail.
SegVol targets radiology and medical-imaging research where annotating 3D CT volumes is expensive and slow. Researchers can use it to generate masks for organs, tissues, and lesions across many anatomical targets from a single checkpoint, accelerating dataset annotation, organ measurement, and downstream pipelines such as treatment planning or quantitative imaging studies. The text-prompt interface is particularly useful for semantic segmentation when no manual click is available, and the spatial prompts support interactive correction for clinicians and annotators.
SegVol helped establish promptable 3D foundation models as a practical direction for medical image segmentation, extending the SAM paradigm from 2D natural images to volumetric clinical scans while adding text-driven semantic control. Its NeurIPS 2024 spotlight, broad anatomical coverage, and fully open release of weights, code, and the M3D-Seg dataset have made it a widely referenced baseline and starting point for subsequent universal medical segmentation work. The main limitations are its focus on CT (rather than MRI or other modalities) and the usual caveat that prompt-driven outputs benefit from expert review before clinical use.
Du, Y., et al. (2023) SegVol: Universal and Interactive Volumetric Medical Image Segmentation. Neural Information Processing Systems.
DOI: 10.48550/arXiv.2311.13385Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data