PTUnifier

Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University / Shenzhen Research Institute of Big Data

Medical vision-language pretraining unifying fusion-encoder and dual-encoder designs, handling image-only, text-only, and paired inputs in one model.

Released: February 2023

Medical vision-and-language pretraining (Med-VLP) learns generic representations from large collections of medical images paired with their textual reports, then transfers them to downstream tasks such as visual question answering, report generation, and cross-modal retrieval. Before PTUnifier, the field was split between two incompatible architectural families: fusion-encoder models that jointly attend over image-text pairs (strong on multimodal reasoning, but requiring paired inputs) and dual-encoder models that encode each modality separately (efficient for retrieval and uni-modal tasks, but weaker at fine-grained fusion). A model trained as one type could not serve the task formats native to the other.

PTUnifier, introduced by Zhihong Chen and colleagues at the Chinese University of Hong Kong, Shenzhen, Sun Yat-sen University, and the Shenzhen Research Institute of Big Data, unifies these two paradigms within a single pretrained model. Published at ICCV 2023 (preprint February 2023), it inserts learnable visual prompts and textual prompts that act as a feature bank storing the most representative images and texts. When an input lacks one modality, the corresponding prompts stand in for the missing image or text, letting the same network process image-only, text-only, and image-text-pair inputs without architectural changes.

By making prompts the bridge between encoder types, PTUnifier turns what had been a hard design choice into a single configurable model that covers uni-modal, cross-modal, and multimodal medical tasks.

Key Features

Unified architecture: A single pretrained model subsumes both fusion-encoder and dual-encoder behavior, eliminating the need to choose an architecture upfront for a given downstream task.
Soft prompts as a feature bank: Visual and textual prompts store representative features and substitute for absent modalities, so image-only, text-only, and paired inputs all flow through the same network.
Dynamic prompt pool: Rather than a fixed set of prompts, a pool is sampled dynamically during training to improve diversity, scalability, and representativeness of the stored features.
Broad task coverage: One model handles classification, report summarization, image-to-text generation, cross-modal retrieval, and visual question answering.
Complementary design: The prompting mechanism is largely orthogonal to existing Med-VLP methods, making it a drop-in extension rather than a competing framework.

Technical Details

PTUnifier builds on a METER-style transformer backbone, combining a vision transformer image encoder and a text encoder with a multimodal fusion module, and is pretrained with standard Med-VLP objectives including masked language modeling, image-text matching, and image-text contrastive learning. Pretraining uses radiology-focused image-text corpora — ROCO, MIMIC-CXR, and MediCAT. Evaluation spans uni-modal tasks (multi-label classification on CheXpert, RSNA Pneumonia classification, RadNLI, report summarization), cross-modal tasks (ROCO retrieval and MIMIC-CXR report generation), and multimodal visual question answering on VQA-RAD, SLAKE, and MedVQA-2019, where the unified model reports competitive-to-state-of-the-art results across the suite at the time of publication. Pretrained checkpoints and training/fine-tuning code are released in the official repository.

Applications

PTUnifier is aimed at researchers building clinical AI systems over radiology data, especially chest X-rays and broader radiology image-report collections. Because one pretrained model serves retrieval, report generation, summarization, classification, and VQA, teams can fine-tune a single backbone across an entire medical imaging workflow instead of maintaining separate fusion- and dual-encoder models. This is useful for prototyping diagnostic assistants, report-drafting tools, and image search over institutional archives.

Impact

PTUnifier demonstrated that the long-standing fusion- versus dual-encoder divide in medical vision-language pretraining could be resolved with a prompt bank rather than a new architecture, and its ICCV 2023 publication and public code made it a reference point for subsequent unified Med-VLP work (for example, later efforts such as MedUnifier that add vision-generation objectives). Its main limitations mirror the field: pretraining is concentrated on radiology and chest X-ray data, MIMIC-CXR access requires PhysioNet credentialing, and performance on modalities outside the training distribution is not guaranteed.

Citations

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.

DOI: 10.1109/ICCV51070.2023.02139

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Preprint

Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.

DOI: 10.48550/arXiv.2302.08958

Recent citations

Papers that recently cited this model.

Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions
Le Zou, Mengyu Ma, Jun Li, et al.
Italian National Conference on Sensors · Jun 2026
0
Deep learning Algorithm for Wound assessment after total kNee (DAWN) arthroplasty
S.Y. Pendyala, Adi Vijay, Nimra Akram, et al.
Bone & Joint Open · May 2026
0
MG-3D: Multi-grained knowledge-enhanced vision-language pre-training for 3D medical image analysis.
Xuefeng Ni, Linshan Wu, Jiaxin Zhuang, et al.
Medical Image Analysis · Mar 2026
1

Top citations

The most-cited papers that cite this model.

Prompt Engineering for Healthcare: Methodologies and Applications
Jiaqi Wang, Enze Shi, Sigang Yu, et al.
Meta-Radiology · Apr 2023
179
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad, Reza Azad, Sania Eskandari, et al.
arXiv.org · Oct 2023
125
Multimodal generative AI for medical image interpretation
Vishwanatha M. Rao, Michael Hla, Michael Moor, et al.
Nature · Mar 2025
92
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion
Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, et al.
International Conference on Learning Representations · Mar 2024
89
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin, Yifan Zhu, Xin Mei, et al.
Information Fusion · Aug 2024
85

Citations

Total Citations53

Influential5

References72

GitHub

Stars78

Forks3

Open Issues7

Contributors1

Last Push2y ago

LanguagePython

Fields of citing research

Computer Science98%
Medicine91%
Engineering17%
Physics2%
Biology2%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

56Partial

Usability — can I run it?62

Reproducibility — can I retrain it?50

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website

Key Features

Unified architecture: A single pretrained model subsumes both fusion-encoder and dual-encoder behavior, eliminating the need to choose an architecture upfront for a given downstream task.

Soft prompts as a feature bank: Visual and textual prompts store representative features and substitute for absent modalities, so image-only, text-only, and paired inputs all flow through the same network.

Dynamic prompt pool: Rather than a fixed set of prompts, a pool is sampled dynamically during training to improve diversity, scalability, and representativeness of the stored features.

Broad task coverage: One model handles classification, report summarization, image-to-text generation, cross-modal retrieval, and visual question answering.

Complementary design: The prompting mechanism is largely orthogonal to existing Med-VLP methods, making it a drop-in extension rather than a competing framework.

Technical Details

Applications

Impact

Citations

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.

DOI: 10.1109/ICCV51070.2023.02139

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Preprint

Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.

DOI: 10.48550/arXiv.2302.08958

PTUnifier

#Key Features

#Technical Details

#Applications

#Impact

Citations

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PTUnifier

#Key Features

#Technical Details

#Applications

#Impact

Citations

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact