Hong Kong University of Science and Technology
A volume contrastive self-supervised framework that pretrains 3D medical image encoders by predicting the anatomical position of sub-volumes within CT scans.
VoCo (Volume Contrast) is a self-supervised learning framework for pretraining 3D medical image encoders, designed to reduce the heavy annotation burden that limits volumetric tasks such as CT segmentation. Most self-supervised methods for medical imaging adapt natural-image objectives — masked reconstruction or instance discrimination — that ignore a property unique to clinical scans: human anatomy is highly structured, and organs occupy consistent relative positions across patients. VoCo turns this anatomical regularity into a free supervision signal.
The core idea is a contextual position-prediction task. From a 3D volume, the framework extracts a set of non-overlapping "base" crops that tile distinct anatomical regions and are encouraged to be mutually discriminative in feature space. A randomly sampled sub-volume is then matched against these base crops by contrastive similarity, and the model learns to predict the proportional overlap — effectively asking "where in the body does this patch belong?" Solving this task forces the encoder to internalize organ positions and geometric relationships without any manual labels.
VoCo was introduced by Linshan Wu, Jiaxin Zhuang, and Hao Chen at the Hong Kong University of Science and Technology and published at CVPR 2024. The authors released pretrained backbones and the curated pretraining datasets, and have since extended the approach to a substantially larger-scale follow-up for general-purpose 3D medical representation learning.
VoCo uses a Swin Transformer encoder in a SwinUNETR-style architecture, the standard backbone for 3D volumetric medical analysis. Pretraining data is aggregated from open-source CT collections — including BTCV, TCIA-COVID-19, LUNA16, STOIC21, TotalSegmentator, FLARE23, LIDC, and HNSCC — to form a 10k-volume corpus, with a larger 160k-volume pretrained model also released. The released checkpoints are distributed under Apache-2.0. In the CVPR 2024 evaluation, VoCo was assessed across six downstream 3D medical tasks (covering segmentation and classification) and outperformed prior self-supervised approaches such as masked-image-modeling and instance-discrimination baselines, demonstrating that explicit position supervision yields stronger transfer than label-agnostic pretext tasks.
VoCo targets researchers and clinical-AI developers building 3D medical image models where labeled volumes are scarce and expensive. Its pretrained encoders serve as a strong initialization for organ and tumor segmentation, lesion detection, and volumetric classification, shortening training time and improving accuracy when fine-tuning on small task-specific datasets. Because the pretext task captures anatomy, it is especially well suited to whole-body and multi-organ CT workflows, and the released datasets make it a practical starting point for benchmarking new self-supervised methods.
VoCo demonstrated that domain-specific pretext design — exploiting the spatial regularity of human anatomy — can outperform generic self-supervised objectives borrowed from natural images, an influential point for the volumetric medical imaging community. The open release of pretrained weights and a curated multi-source CT corpus lowered the barrier to entry for 3D medical pretraining and has been adopted as a baseline in subsequent work. The authors' large-scale extension, scaling to far more volumes and tasks, builds directly on this framework. The approach is specialized for CT-style volumetric data, and transfer to other modalities such as MRI or to non-axial anatomy may require adaptation of the position-prediction scheme.
Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.
DOI: 10.48550/arXiv.2402.17300Wu, L., et al. (2024) VoCo: A Simple-Yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis. Computer Vision and Pattern Recognition.
DOI: 10.1109/CVPR52733.2024.02158Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data