bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

VoCo

Hong Kong University of Science and Technology

A volume contrastive self-supervised framework that pretrains 3D medical image encoders by predicting the anatomical position of sub-volumes within CT scans.

Released: February 2024

VoCo (Volume Contrast) is a self-supervised learning framework for pretraining 3D medical image encoders, designed to reduce the heavy annotation burden that limits volumetric tasks such as CT segmentation. Most self-supervised methods for medical imaging adapt natural-image objectives — masked reconstruction or instance discrimination — that ignore a property unique to clinical scans: human anatomy is highly structured, and organs occupy consistent relative positions across patients. VoCo turns this anatomical regularity into a free supervision signal.

The core idea is a contextual position-prediction task. From a 3D volume, the framework extracts a set of non-overlapping "base" crops that tile distinct anatomical regions and are encouraged to be mutually discriminative in feature space. A randomly sampled sub-volume is then matched against these base crops by contrastive similarity, and the model learns to predict the proportional overlap — effectively asking "where in the body does this patch belong?" Solving this task forces the encoder to internalize organ positions and geometric relationships without any manual labels.

VoCo was introduced by Linshan Wu, Jiaxin Zhuang, and Hao Chen at the Hong Kong University of Science and Technology and published at CVPR 2024. The authors released pretrained backbones and the curated pretraining datasets, and have since extended the approach to a substantially larger-scale follow-up for general-purpose 3D medical representation learning.

#Key Features

  • Position-aware pretext task: Instead of generic masking, VoCo predicts the anatomical region a sub-volume belongs to, directly encoding organ-position priors that transfer well to downstream localization and segmentation.
  • Contrastive base-crop assignment: Base crops are enforced to be feature-discriminative, providing stable contrastive targets that represent distinct anatomical zones within each scan.
  • Label-free pretraining: The entire objective is derived from spatial geometry, so large unlabeled CT collections can be used without annotation.
  • Released pretrained backbones: Apache-2.0 checkpoints are provided for models pretrained on 10k and 160k CT volumes, ready for fine-tuning on new tasks.
  • Open pretraining corpus: The aggregated 10k-volume dataset, assembled from public collections, is published on Hugging Face for reproducibility.

#Technical Details

VoCo uses a Swin Transformer encoder in a SwinUNETR-style architecture, the standard backbone for 3D volumetric medical analysis. Pretraining data is aggregated from open-source CT collections — including BTCV, TCIA-COVID-19, LUNA16, STOIC21, TotalSegmentator, FLARE23, LIDC, and HNSCC — to form a 10k-volume corpus, with a larger 160k-volume pretrained model also released. The released checkpoints are distributed under Apache-2.0. In the CVPR 2024 evaluation, VoCo was assessed across six downstream 3D medical tasks (covering segmentation and classification) and outperformed prior self-supervised approaches such as masked-image-modeling and instance-discrimination baselines, demonstrating that explicit position supervision yields stronger transfer than label-agnostic pretext tasks.

#Applications

VoCo targets researchers and clinical-AI developers building 3D medical image models where labeled volumes are scarce and expensive. Its pretrained encoders serve as a strong initialization for organ and tumor segmentation, lesion detection, and volumetric classification, shortening training time and improving accuracy when fine-tuning on small task-specific datasets. Because the pretext task captures anatomy, it is especially well suited to whole-body and multi-organ CT workflows, and the released datasets make it a practical starting point for benchmarking new self-supervised methods.

#Impact

VoCo demonstrated that domain-specific pretext design — exploiting the spatial regularity of human anatomy — can outperform generic self-supervised objectives borrowed from natural images, an influential point for the volumetric medical imaging community. The open release of pretrained weights and a curated multi-source CT corpus lowered the barrier to entry for 3D medical pretraining and has been adopted as a baseline in subsequent work. The authors' large-scale extension, scaling to far more volumes and tasks, builds directly on this framework. The approach is specialized for CT-style volumetric data, and transfer to other modalities such as MRI or to non-axial anatomy may require adaptation of the position-prediction scheme.

Citations

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Preprint

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2402.17300

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52733.2024.02158

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations109
Influential19
References74

GitHub

Stars228
Forks17
Open Issues3
Contributors1
Last Push6mo ago
LanguagePython
LicenseApache-2.0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
69Partial
Usability — can I run it?77
Reproducibility — can I retrain it?61
Model Openness Framework
Unclassified
Missing required components

Tags

contrastive_learningctfoundation_modelradiologyrepresentation_learningsegmentationself_supervisedself_supervised_pretrainingswin_transformervision_transformer

Resources

GitHub RepositoryResearch PaperDataset