bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

CLIP-Driven Universal Model

City University of Hong Kong / Johns Hopkins University / NVIDIA

A CLIP text-driven universal model for organ segmentation and tumor detection on abdominal CT, segmenting 25 organs and 6 tumor types with zero-shot extension to new categories.

Released: October 2023

The CLIP-Driven Universal Model addresses a long-standing fragmentation problem in medical image segmentation: most CT segmentation networks are trained on a single, partially-labeled dataset covering only a handful of organs, which forces researchers to maintain dozens of narrow, dataset-specific models. This work, presented at ICCV 2023 by researchers at City University of Hong Kong, Johns Hopkins University, and NVIDIA, instead assembles 14 public datasets into one training corpus and trains a single network that segments 25 organs and detects 6 tumor types across the abdomen.

The central innovation is replacing the conventional one-hot label encoding—which treats every anatomical class as independent and orthogonal—with text embeddings produced by CLIP (Contrastive Language-Image Pre-training). By encoding class names through CLIP's language model, the network captures the semantic and anatomical relationships between structures (for example, that a liver tumor is a substructure of the liver), yielding a more structured feature space and a label-efficient way to learn from datasets that each annotate only some organs.

Because new classes are introduced through text prompts rather than new output heads, the model is extensible: additional organs or tumor types can be incorporated without retraining from scratch, and the framework supports zero-shot recognition of categories described in natural language. This positions it as a foundation-style backbone for abdominal CT analysis rather than a one-off task model.

#Key Features

  • CLIP text-driven labels: Class names are embedded with CLIP's language encoder, so anatomical relationships inform the segmentation decoder instead of treating each organ as an independent one-hot class.
  • Universal multi-dataset training: A single model is trained across 14 assembled datasets that are individually only partially labeled, unifying organ and tumor annotations into one network.
  • Broad coverage: Segments 25 organs and detects 6 tumor types (liver, kidney, pancreas, lung, hepatic vessel, and colon tumors) from abdominal CT.
  • Zero-shot extensibility: New categories can be added via text prompts without architectural changes, enabling generalization to organs and pathologies beyond the original label set.
  • Efficient inference: Roughly 6x faster than running an ensemble of dataset-specific models, since one forward pass covers all targets.

#Technical Details

The architecture pairs a 3D segmentation backbone—released in both a lightweight U-Net variant (~19M parameters) and a Swin UNETR transformer variant (~62M parameters)—with a CLIP text branch. Class-name prompts are passed through the frozen CLIP text encoder, and the resulting embeddings condition a text-driven controller that generates the parameters of the segmentation head, allowing a fixed backbone to produce masks for an open vocabulary of classes. The model was trained on 3,410 CT scans drawn from the 14 assembled datasets and evaluated on 6,162 external CT scans from three additional cohorts. It ranked first on the Medical Segmentation Decathlon (MSD) public leaderboard and achieved state-of-the-art performance on the Beyond The Cranial Vault (BTCV) multi-organ benchmark. Pretrained weights for both backbones are publicly released, and the code is distributed under a CC BY-NC-ND 4.0 (non-commercial) license.

#Applications

The model targets radiology and oncology workflows that require consistent multi-organ segmentation and tumor detection from abdominal CT, including treatment planning, tumor burden quantification, large-scale population imaging studies, and the automated generation of organ annotations for downstream research datasets. Its extensibility makes it useful as a base model that hospitals and research groups can adapt to new organs or pathologies with limited additional labeling, while the released U-Net and Swin UNETR checkpoints lower the barrier to deployment for groups without the resources to assemble large multi-dataset training corpora themselves.

#Impact

By demonstrating that CLIP text embeddings can unify partially-labeled medical datasets into a single high-performing segmentation model, this work helped popularize language-driven, universal approaches to medical image analysis and shifted attention away from siloed, dataset-specific networks. Its top MSD leaderboard ranking and state-of-the-art BTCV results made it a widely cited reference point, and the authors later extended the approach into universal, extensible language-vision models for abdominal CT. Principal limitations are its non-commercial license, its focus on CT of the abdomen (rather than other modalities or anatomical regions), and the dependence of zero-shot performance on how well novel structures are described by text prompts.

Citations

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Preprint

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.48550/arXiv.2301.00785

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Liu, J., et al. (2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. IEEE International Conference on Computer Vision.

DOI: 10.1109/ICCV51070.2023.01934

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations350
Influential43
References120

GitHub

Stars676
Forks79
Open Issues15
Contributors7
Last Push7mo ago
LanguagePython

HuggingFace

Downloads0
Likes0
Last Modified1y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
26Closed
Usability — can I run it?21
Reproducibility — can I retrain it?15
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

cnnct_imagingmultimodalradiologysegmentationtransfer_learningtransformertumor_detectionzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace Model