Imperial College London / University of Oxford / University of Science and Technology of China / Peking University / Hong Kong University of Science and Technology
Text-informed self-supervised vision-language pretraining for 3D CT volumes, enabling zero-shot classification, retrieval, report generation, and segmentation.
T3D is a self-supervised vision-language pretraining framework for three-dimensional medical images, designed to learn transferable representations of volumetric CT scans by aligning them with their paired radiology reports. Most medical vision-language pretraining (MedVLP) methods operate on 2D images such as chest X-rays, leaving volumetric modalities like CT relatively underserved despite their central role in clinical diagnosis. T3D addresses this gap by treating each CT volume and its free-text report as a paired training signal, producing image encoders that can be applied to downstream tasks with little or no task-specific labeled data.
The framework was introduced in December 2023 by Che Liu and colleagues at Imperial College London, working with collaborators at the University of Oxford, the University of Science and Technology of China, Peking University, and the Hong Kong University of Science and Technology. A central contribution is the release of CT-3DVLP, described by the authors as the first and largest public 3D volume-report dataset for this task, which provides a shared benchmark for an area that previously lacked large-scale public data.
Rather than applying naive CLIP-style global alignment between a whole volume and its report, T3D introduces Text-informed Multi-view Alignment (TMA). TMA enforces consistency across different augmented views of the same volume-report pair while injecting textual features into fine-grained visual representations, encouraging the encoder to capture clinically meaningful structure rather than superficial correlations.
T3D pairs a 3D vision encoder (a 3D ResNet-50 from the MONAI implementation) with the Med-CPT text encoder to embed CT volumes and their reports into a shared space. Volumes are resampled to [1, 1, 4] mm spacing and resized to 256x256x128 before training. The CT-3DVLP dataset aggregates 52,639 paired CT volumes and radiology reports drawn from three public sources: CT-RATE (25,691), INSPECT (20,400), and BIMCV-R (6,548). On the CT-RATE zero-shot abnormality classification benchmark, T3D reports an AUC of 73.7, accuracy of 69.0, and F1 of 72.5. After fine-tuning, the learned features yield strong segmentation performance, including Dice scores of 89.83 on AMOS and 70.12 on MSD-Lung, and the model improves over prior methods on cross-modal retrieval. Across the evaluated tasks, T3D consistently outperforms CLIP-style baselines and 2D-adapted approaches.
T3D targets radiology workflows that rely on volumetric CT, where labeled data is expensive to obtain. The pretrained encoder can be used off the shelf for zero-shot screening of common findings, for retrieving similar prior cases by image or by text query, and as an initialization that improves data efficiency when fine-tuning organ or lesion segmentation models. Researchers building 3D medical foundation models benefit both from the released encoder and from CT-3DVLP as a standardized pretraining and evaluation corpus.
By demonstrating that report-guided self-supervision transfers effectively to volumetric imaging, T3D helped extend the medical vision-language paradigm from 2D to 3D and established an early public benchmark for the task through CT-3DVLP. The work has been cited by subsequent 3D MedVLP efforts and contributed to a growing line of research on scalable language-image pretraining for CT. A practical limitation is that, per the authors, full data and code were slated for release upon publication, so availability of pretrained weights should be confirmed against the latest repository before relying on them in production.
Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
DOI: 10.1109/ICCVW69036.2025.00698Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
DOI: 10.48550/arXiv.2312.01529Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data