CLIP-Driven Universal Model

City University of Hong Kong / Johns Hopkins University / NVIDIA

Abdominal CT segmentation model driven by CLIP text embeddings, covering 25 organs and 6 tumor types with zero-shot extension to new categories.

Released: October 2023

The CLIP-Driven Universal Model addresses a long-standing fragmentation problem in medical image segmentation: most CT segmentation networks are trained on a single, partially-labeled dataset covering only a handful of organs, which forces researchers to maintain dozens of narrow, dataset-specific models. This work, presented at ICCV 2023 by researchers at City University of Hong Kong, Johns Hopkins University, and NVIDIA, instead assembles 14 public datasets into one training corpus and trains a single network that segments 25 organs and detects 6 tumor types across the abdomen.

The central innovation is replacing the conventional one-hot label encoding—which treats every anatomical class as independent and orthogonal—with text embeddings produced by CLIP (Contrastive Language-Image Pre-training). By encoding class names through CLIP's language model, the network captures the semantic and anatomical relationships between structures (for example, that a liver tumor is a substructure of the liver), yielding a more structured feature space and a label-efficient way to learn from datasets that each annotate only some organs.

Because new classes are introduced through text prompts rather than new output heads, the model is extensible: additional organs or tumor types can be incorporated without retraining from scratch, and the framework supports zero-shot recognition of categories described in natural language. This positions it as a foundation-style backbone for abdominal CT analysis rather than a one-off task model.

Key Features

CLIP text-driven labels: Class names are embedded with CLIP's language encoder, so anatomical relationships inform the segmentation decoder instead of treating each organ as an independent one-hot class.
Universal multi-dataset training: A single model is trained across 14 assembled datasets that are individually only partially labeled, unifying organ and tumor annotations into one network.
Broad coverage: Segments 25 organs and detects 6 tumor types (liver, kidney, pancreas, lung, hepatic vessel, and colon tumors) from abdominal CT.
Zero-shot extensibility: New categories can be added via text prompts without architectural changes, enabling generalization to organs and pathologies beyond the original label set.
Efficient inference: Roughly 6x faster than running an ensemble of dataset-specific models, since one forward pass covers all targets.

Technical Details

The architecture pairs a 3D segmentation backbone—released in both a lightweight U-Net variant (~19M parameters) and a Swin UNETR transformer variant (~62M parameters)—with a CLIP text branch. Class-name prompts are passed through the frozen CLIP text encoder, and the resulting embeddings condition a text-driven controller that generates the parameters of the segmentation head, allowing a fixed backbone to produce masks for an open vocabulary of classes. The model was trained on 3,410 CT scans drawn from the 14 assembled datasets and evaluated on 6,162 external CT scans from three additional cohorts. It ranked first on the Medical Segmentation Decathlon (MSD) public leaderboard and achieved state-of-the-art performance on the Beyond The Cranial Vault (BTCV) multi-organ benchmark. Pretrained weights for both backbones are publicly released, and the code is distributed under a CC BY-NC-ND 4.0 (non-commercial) license.

Applications

The model targets radiology and oncology workflows that require consistent multi-organ segmentation and tumor detection from abdominal CT, including treatment planning, tumor burden quantification, large-scale population imaging studies, and the automated generation of organ annotations for downstream research datasets. Its extensibility makes it useful as a base model that hospitals and research groups can adapt to new organs or pathologies with limited additional labeling, while the released U-Net and Swin UNETR checkpoints lower the barrier to deployment for groups without the resources to assemble large multi-dataset training corpora themselves.

Impact

By demonstrating that CLIP text embeddings can unify partially-labeled medical datasets into a single high-performing segmentation model, this work helped popularize language-driven, universal approaches to medical image analysis and shifted attention away from siloed, dataset-specific networks. Its top MSD leaderboard ranking and state-of-the-art BTCV results made it a widely cited reference point, and the authors later extended the approach into universal, extensible language-vision models for abdominal CT. Principal limitations are its non-commercial license, its focus on CT of the abdomen (rather than other modalities or anatomical regions), and the dependence of zero-shot performance on how well novel structures are described by text prompts.

Citations

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Preprint

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.48550/arXiv.2301.00785

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.1109/ICCV51070.2023.01934

Recent citations

Papers that recently cited this model.

VL-Grading: A vision–language distillation framework for preoperative grading of clear cell renal cell carcinoma from multi-phase contrast-enhanced ultrasound
Min Zou, Yuhang Liu, Jiang Shang, et al.
Biomedical Signal Processing and Control · 2026
0
Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices
Nils Neukirch, Martin H. Maurer, Nils Strodthoff
Jul 2026
0
APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms
Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Medical Image Analysis
Zongwei Zhou, V. Sodha, Jiaxuan Pang, et al.
458
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
Qihang Zhou, Guansong Pang, Yu Tian, et al.
International Conference on Learning Representations · Oct 2023
403
SAM-Med3D: Towards General-Purpose Segmentation Models for Volumetric Medical Images
Haoyu Wang, Sizheng Guo, Jin Ye, et al.
ECCV Workshops · Oct 2023
173
Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134
MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images
A. Diaz-Pinto, Sachidanand Alle, Alvin Ihsani, et al.
Medical Image Anal. · Mar 2022
130

Citations

Total Citations356

Influential42

References120

GitHub

Stars677

Forks78

Open Issues15

Contributors7

Last Push9mo ago

LanguagePython

HuggingFace

Downloads0

Likes0

Last Modified1y ago

Fields of citing research

Computer Science99%
Medicine87%
Engineering33%
Environmental Science2%
Physics1%
Biology1%
Geography0%
Geology0%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

26Closed

Usability — can I run it?21

Reproducibility — can I retrain it?15

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

CLIP text-driven labels: Class names are embedded with CLIP's language encoder, so anatomical relationships inform the segmentation decoder instead of treating each organ as an independent one-hot class.

Universal multi-dataset training: A single model is trained across 14 assembled datasets that are individually only partially labeled, unifying organ and tumor annotations into one network.

Broad coverage: Segments 25 organs and detects 6 tumor types (liver, kidney, pancreas, lung, hepatic vessel, and colon tumors) from abdominal CT.

Zero-shot extensibility: New categories can be added via text prompts without architectural changes, enabling generalization to organs and pathologies beyond the original label set.

Efficient inference: Roughly 6x faster than running an ensemble of dataset-specific models, since one forward pass covers all targets.

Technical Details

Applications

Impact

Citations

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Preprint

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.48550/arXiv.2301.00785

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.1109/ICCV51070.2023.01934

Recent citations

Papers that recently cited this model.

VL-Grading: A vision–language distillation framework for preoperative grading of clear cell renal cell carcinoma from multi-phase contrast-enhanced ultrasound

Min Zou, Yuhang Liu, Jiang Shang, et al.

Biomedical Signal Processing and Control · 2026

Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

Nils Neukirch, Martin H. Maurer, Nils Strodthoff

Jul 2026

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.

Jun 2026

CLIP-Driven Universal Model

#Key Features

#Technical Details

#Applications

#Impact

Citations

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Recent citations

Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

CLIP-Driven Universal Model

#Key Features

#Technical Details

#Applications

#Impact

Citations

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Recent citations

Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact