VoCo

Hong Kong University of Science and Technology

Self-supervised pretraining framework for 3D medical image encoders that learns anatomy by predicting where a sub-volume sits within a CT scan.

Released: February 2024

VoCo (Volume Contrast) is a self-supervised learning framework for pretraining 3D medical image encoders, designed to reduce the heavy annotation burden that limits volumetric tasks such as CT segmentation. Most self-supervised methods for medical imaging adapt natural-image objectives — masked reconstruction or instance discrimination — that ignore a property unique to clinical scans: human anatomy is highly structured, and organs occupy consistent relative positions across patients. VoCo turns this anatomical regularity into a free supervision signal.

The core idea is a contextual position-prediction task. From a 3D volume, the framework extracts a set of non-overlapping "base" crops that tile distinct anatomical regions and are encouraged to be mutually discriminative in feature space. A randomly sampled sub-volume is then matched against these base crops by contrastive similarity, and the model learns to predict the proportional overlap — effectively asking "where in the body does this patch belong?" Solving this task forces the encoder to internalize organ positions and geometric relationships without any manual labels.

VoCo was introduced by Linshan Wu, Jiaxin Zhuang, and Hao Chen at the Hong Kong University of Science and Technology and published at CVPR 2024. The authors released pretrained backbones and the curated pretraining datasets, and have since extended the approach to a substantially larger-scale follow-up for general-purpose 3D medical representation learning.

Key Features

Position-aware pretext task: Instead of generic masking, VoCo predicts the anatomical region a sub-volume belongs to, directly encoding organ-position priors that transfer well to downstream localization and segmentation.
Contrastive base-crop assignment: Base crops are enforced to be feature-discriminative, providing stable contrastive targets that represent distinct anatomical zones within each scan.
Label-free pretraining: The entire objective is derived from spatial geometry, so large unlabeled CT collections can be used without annotation.
Released pretrained backbones: Apache-2.0 checkpoints are provided for models pretrained on 10k and 160k CT volumes, ready for fine-tuning on new tasks.
Open pretraining corpus: The aggregated 10k-volume dataset, assembled from public collections, is published on Hugging Face for reproducibility.

Technical Details

VoCo uses a Swin Transformer encoder in a SwinUNETR-style architecture, the standard backbone for 3D volumetric medical analysis. Pretraining data is aggregated from open-source CT collections — including BTCV, TCIA-COVID-19, LUNA16, STOIC21, TotalSegmentator, FLARE23, LIDC, and HNSCC — to form a 10k-volume corpus, with a larger 160k-volume pretrained model also released. The released checkpoints are distributed under Apache-2.0. In the CVPR 2024 evaluation, VoCo was assessed across six downstream 3D medical tasks (covering segmentation and classification) and outperformed prior self-supervised approaches such as masked-image-modeling and instance-discrimination baselines, demonstrating that explicit position supervision yields stronger transfer than label-agnostic pretext tasks.

Applications

VoCo targets researchers and clinical-AI developers building 3D medical image models where labeled volumes are scarce and expensive. Its pretrained encoders serve as a strong initialization for organ and tumor segmentation, lesion detection, and volumetric classification, shortening training time and improving accuracy when fine-tuning on small task-specific datasets. Because the pretext task captures anatomy, it is especially well suited to whole-body and multi-organ CT workflows, and the released datasets make it a practical starting point for benchmarking new self-supervised methods.

Impact

VoCo demonstrated that domain-specific pretext design — exploiting the spatial regularity of human anatomy — can outperform generic self-supervised objectives borrowed from natural images, an influential point for the volumetric medical imaging community. The open release of pretrained weights and a curated multi-source CT corpus lowered the barrier to entry for 3D medical pretraining and has been adopted as a baseline in subsequent work. The authors' large-scale extension, scaling to far more volumes and tasks, builds directly on this framework. The approach is specialized for CT-style volumetric data, and transfer to other modalities such as MRI or to non-axial anatomy may require adaptation of the position-prediction scheme.

Citations

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Preprint

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2402.17300

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52733.2024.02158

Recent citations

Papers that recently cited this model.

Topology-Driven Transferability Estimation for 3D Medical Vision Foundation Models
Jiaqi Tang, Shaoyang Zhang, Fandong Zhang, et al.
Jul 2026
0Influential
StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT
Weiru Wang, S. Olthuis, E. Lavrova, et al.
Jun 2026
0Influential
Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes
Renjie Liang, Z. Fan, Jinqian Pan, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134
Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications, Challenges, and Future Directions
Kai Sun, Siyan Xue, Fuchun Sun, et al.
Artif. Intell. Medicine · Dec 2024
39
Large-Scale 3D Medical Image Pre-Training With Geometric Context Priors
Linshan Wu, Jiaxin Zhuang, Hao Chen
IEEE Transactions on Pattern Analysis and Machine Intelligence · Oct 2024
37Influential
Revisiting MAE pre-training for 3D medical image segmentation
Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, et al.
Computer Vision and Pattern Recognition · Oct 2024
34Influential
Advancing Volumetric Medical Image Segmentation via Global-Local Masked Autoencoders
Jiafan Zhuang, Luyang Luo, Hao Chen
IEEE Transactions on Medical Imaging · Jun 2023
33

Citations

Total Citations113

Influential21

References74

GitHub

Stars230

Forks18

Open Issues3

Contributors1

Last Push7mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science99%
Medicine96%
Engineering32%
Biology3%
Environmental Science1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

69Partial

Usability — can I run it?77

Reproducibility — can I retrain it?61

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Dataset

Key Features

Position-aware pretext task: Instead of generic masking, VoCo predicts the anatomical region a sub-volume belongs to, directly encoding organ-position priors that transfer well to downstream localization and segmentation.

Contrastive base-crop assignment: Base crops are enforced to be feature-discriminative, providing stable contrastive targets that represent distinct anatomical zones within each scan.

Label-free pretraining: The entire objective is derived from spatial geometry, so large unlabeled CT collections can be used without annotation.

Released pretrained backbones: Apache-2.0 checkpoints are provided for models pretrained on 10k and 160k CT volumes, ready for fine-tuning on new tasks.

Open pretraining corpus: The aggregated 10k-volume dataset, assembled from public collections, is published on Hugging Face for reproducibility.

Technical Details

Applications

Impact

Citations

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Preprint

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.48550/arXiv.2402.17300

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.

DOI: 10.1109/CVPR52733.2024.02158

Recent citations

Papers that recently cited this model.

Topology-Driven Transferability Estimation for 3D Medical Vision Foundation Models

Jiaqi Tang, Shaoyang Zhang, Fandong Zhang, et al.

Jul 2026

0Influential

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

Weiru Wang, S. Olthuis, E. Lavrova, et al.

Jun 2026

0Influential

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Renjie Liang, Z. Fan, Jinqian Pan, et al.

Jun 2026

VoCo

#Key Features

#Technical Details

#Applications

#Impact

Citations

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Recent citations

Topology-Driven Transferability Estimation for 3D Medical Vision Foundation Models

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

VoCo

#Key Features

#Technical Details

#Applications

#Impact

Citations

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Recent citations

Topology-Driven Transferability Estimation for 3D Medical Vision Foundation Models

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact