City University of Hong Kong / Johns Hopkins University / NVIDIA
A CLIP text-driven universal model for organ segmentation and tumor detection on abdominal CT, segmenting 25 organs and 6 tumor types with zero-shot extension to new categories.
The CLIP-Driven Universal Model addresses a long-standing fragmentation problem in medical image segmentation: most CT segmentation networks are trained on a single, partially-labeled dataset covering only a handful of organs, which forces researchers to maintain dozens of narrow, dataset-specific models. This work, presented at ICCV 2023 by researchers at City University of Hong Kong, Johns Hopkins University, and NVIDIA, instead assembles 14 public datasets into one training corpus and trains a single network that segments 25 organs and detects 6 tumor types across the abdomen.
The central innovation is replacing the conventional one-hot label encoding—which treats every anatomical class as independent and orthogonal—with text embeddings produced by CLIP (Contrastive Language-Image Pre-training). By encoding class names through CLIP's language model, the network captures the semantic and anatomical relationships between structures (for example, that a liver tumor is a substructure of the liver), yielding a more structured feature space and a label-efficient way to learn from datasets that each annotate only some organs.
Because new classes are introduced through text prompts rather than new output heads, the model is extensible: additional organs or tumor types can be incorporated without retraining from scratch, and the framework supports zero-shot recognition of categories described in natural language. This positions it as a foundation-style backbone for abdominal CT analysis rather than a one-off task model.
The architecture pairs a 3D segmentation backbone—released in both a lightweight U-Net variant (~19M parameters) and a Swin UNETR transformer variant (~62M parameters)—with a CLIP text branch. Class-name prompts are passed through the frozen CLIP text encoder, and the resulting embeddings condition a text-driven controller that generates the parameters of the segmentation head, allowing a fixed backbone to produce masks for an open vocabulary of classes. The model was trained on 3,410 CT scans drawn from the 14 assembled datasets and evaluated on 6,162 external CT scans from three additional cohorts. It ranked first on the Medical Segmentation Decathlon (MSD) public leaderboard and achieved state-of-the-art performance on the Beyond The Cranial Vault (BTCV) multi-organ benchmark. Pretrained weights for both backbones are publicly released, and the code is distributed under a CC BY-NC-ND 4.0 (non-commercial) license.
The model targets radiology and oncology workflows that require consistent multi-organ segmentation and tumor detection from abdominal CT, including treatment planning, tumor burden quantification, large-scale population imaging studies, and the automated generation of organ annotations for downstream research datasets. Its extensibility makes it useful as a base model that hospitals and research groups can adapt to new organs or pathologies with limited additional labeling, while the released U-Net and Swin UNETR checkpoints lower the barrier to deployment for groups without the resources to assemble large multi-dataset training corpora themselves.
By demonstrating that CLIP text embeddings can unify partially-labeled medical datasets into a single high-performing segmentation model, this work helped popularize language-driven, universal approaches to medical image analysis and shifted attention away from siloed, dataset-specific networks. Its top MSD leaderboard ranking and state-of-the-art BTCV results made it a widely cited reference point, and the authors later extended the approach into universal, extensible language-vision models for abdominal CT. Principal limitations are its non-commercial license, its focus on CT of the abdomen (rather than other modalities or anatomical regions), and the dependence of zero-shot performance on how well novel structures are described by text prompts.
Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.
DOI: 10.48550/arXiv.2301.00785Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.
DOI: 10.1109/ICCV51070.2023.01934Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data