T3D

Imperial College London / University of Oxford / University of Science and Technology of China / Peking University / Hong Kong University of Science and Technology

Vision-language pretraining for 3D CT volumes, aligning scans with their radiology reports for zero-shot classification, retrieval, and segmentation.

Released: December 2023

T3D is a self-supervised vision-language pretraining framework for three-dimensional medical images, designed to learn transferable representations of volumetric CT scans by aligning them with their paired radiology reports. Most medical vision-language pretraining (MedVLP) methods operate on 2D images such as chest X-rays, leaving volumetric modalities like CT relatively underserved despite their central role in clinical diagnosis. T3D addresses this gap by treating each CT volume and its free-text report as a paired training signal, producing image encoders that can be applied to downstream tasks with little or no task-specific labeled data.

The framework was introduced in December 2023 by Che Liu and colleagues at Imperial College London, working with collaborators at the University of Oxford, the University of Science and Technology of China, Peking University, and the Hong Kong University of Science and Technology. A central contribution is the release of CT-3DVLP, described by the authors as the first and largest public 3D volume-report dataset for this task, which provides a shared benchmark for an area that previously lacked large-scale public data.

Rather than applying naive CLIP-style global alignment between a whole volume and its report, T3D introduces Text-informed Multi-view Alignment (TMA). TMA enforces consistency across different augmented views of the same volume-report pair while injecting textual features into fine-grained visual representations, encouraging the encoder to capture clinically meaningful structure rather than superficial correlations.

Key Features

Volumetric vision-language pretraining: Operates directly on 3D CT volumes paired with radiology reports, extending self-supervised MedVLP beyond the more common 2D X-ray setting.
Text-informed Multi-view Alignment (TMA): Enforces representation consistency across multiple views of a volume while integrating report-derived text features into fine-grained visual embeddings.
Zero-shot capability: The pretrained encoder supports zero-shot classification and cross-modal retrieval without downstream label supervision.
Broad downstream transfer: A single pretrained model transfers to classification, cross-modal retrieval, report generation, and semantic segmentation in both unimodal and cross-modal settings.
Public benchmark dataset: Accompanied by CT-3DVLP, a large public 3D volume-report corpus that lowers the barrier to reproducible 3D MedVLP research.

Technical Details

T3D pairs a 3D vision encoder (a 3D ResNet-50 from the MONAI implementation) with the Med-CPT text encoder to embed CT volumes and their reports into a shared space. Volumes are resampled to [1, 1, 4] mm spacing and resized to 256x256x128 before training. The CT-3DVLP dataset aggregates 52,639 paired CT volumes and radiology reports drawn from three public sources: CT-RATE (25,691), INSPECT (20,400), and BIMCV-R (6,548). On the CT-RATE zero-shot abnormality classification benchmark, T3D reports an AUC of 73.7, accuracy of 69.0, and F1 of 72.5. After fine-tuning, the learned features yield strong segmentation performance, including Dice scores of 89.83 on AMOS and 70.12 on MSD-Lung, and the model improves over prior methods on cross-modal retrieval. Across the evaluated tasks, T3D consistently outperforms CLIP-style baselines and 2D-adapted approaches.

Applications

T3D targets radiology workflows that rely on volumetric CT, where labeled data is expensive to obtain. The pretrained encoder can be used off the shelf for zero-shot screening of common findings, for retrieving similar prior cases by image or by text query, and as an initialization that improves data efficiency when fine-tuning organ or lesion segmentation models. Researchers building 3D medical foundation models benefit both from the released encoder and from CT-3DVLP as a standardized pretraining and evaluation corpus.

Impact

By demonstrating that report-guided self-supervision transfers effectively to volumetric imaging, T3D helped extend the medical vision-language paradigm from 2D to 3D and established an early public benchmark for the task through CT-3DVLP. The work has been cited by subsequent 3D MedVLP efforts and contributed to a growing line of research on scalable language-image pretraining for CT. A practical limitation is that, per the authors, full data and code were slated for release upon publication, so availability of pretrained weights should be confirmed against the latest repository before relying on them in production.

Citations

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

DOI: 10.1109/ICCVW69036.2025.00698

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Preprint

Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

DOI: 10.48550/arXiv.2312.01529

Recent citations

Papers that recently cited this model.

MAC-Splat: Multi-Attribute Consistency for High-Fidelity Sparse-View Reconstruction
Jinqian Yang, Yichen Wu, Wanhua Li, et al.
Jul 2026
0
Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions
Le Zou, Mengyu Ma, Jun Li, et al.
Italian National Conference on Sensors · Jun 2026
0
ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training
Rongsheng Wang, Fenghe Tang, Zihang Jiang, et al.
May 2026
0

Top citations

The most-cited papers that cite this model.

Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134
Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement
Che Liu, Zhongwei Wan, Ouyang Cheng, et al.
International Conference on Machine Learning · Mar 2024
92
IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
Che Liu, Sibo Cheng, Miaojing Shi, et al.
IEEE Transactions on Medical Imaging · Oct 2023
42
BIMCV-R: A Landmark Dataset for 3D CT Text-Image Retrieval
Yinda Chen, Che Liu, Xiaoyu Liu, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Mar 2024
41
Learning Multiscale Consistency for Self-Supervised Electron Microscopy Instance Segmentation
Yinda Chen, Wei Huang, Xiaoyu Liu, et al.
IEEE International Conference on Acoustics, Speech, and Signal Processing · Aug 2023
27

Citations

Total Citations16

Influential0

References57

Fields of citing research

Computer Science100%
Medicine73%
Engineering40%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?7

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website

Key Features

Volumetric vision-language pretraining: Operates directly on 3D CT volumes paired with radiology reports, extending self-supervised MedVLP beyond the more common 2D X-ray setting.

Text-informed Multi-view Alignment (TMA): Enforces representation consistency across multiple views of a volume while integrating report-derived text features into fine-grained visual embeddings.

Zero-shot capability: The pretrained encoder supports zero-shot classification and cross-modal retrieval without downstream label supervision.

Broad downstream transfer: A single pretrained model transfers to classification, cross-modal retrieval, report generation, and semantic segmentation in both unimodal and cross-modal settings.

Public benchmark dataset: Accompanied by CT-3DVLP, a large public 3D volume-report corpus that lowers the barrier to reproducible 3D MedVLP research.

Technical Details

Applications

Impact

Citations

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

DOI: 10.1109/ICCVW69036.2025.00698

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Preprint

Liu, C., et al. (2023) T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency. 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

DOI: 10.48550/arXiv.2312.01529

Recent citations

Papers that recently cited this model.

MAC-Splat: Multi-Attribute Consistency for High-Fidelity Sparse-View Reconstruction

Jinqian Yang, Yichen Wu, Wanhua Li, et al.

Jul 2026

Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions

Le Zou, Mengyu Ma, Jun Li, et al.

Italian National Conference on Sensors · Jun 2026

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

Rongsheng Wang, Fenghe Tang, Zihang Jiang, et al.

May 2026

T3D

#Key Features

#Technical Details

#Applications

#Impact

Citations

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Recent citations

MAC-Splat: Multi-Attribute Consistency for High-Fidelity Sparse-View Reconstruction

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

T3D

#Key Features

#Technical Details

#Applications

#Impact

Citations

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

T3D: Advancing 3D Medical Vision-Language Pre-Training by Learning Multi-View Visual Consistency

Recent citations

MAC-Splat: Multi-Attribute Consistency for High-Fidelity Sparse-View Reconstruction

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact