Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University / Shenzhen Research Institute of Big Data
Prompt-based medical vision-language pretraining that unifies fusion-encoder and dual-encoder architectures, handling image-only, text-only, and image-text inputs in one model.
Medical vision-and-language pretraining (Med-VLP) learns generic representations from large collections of medical images paired with their textual reports, then transfers them to downstream tasks such as visual question answering, report generation, and cross-modal retrieval. Before PTUnifier, the field was split between two incompatible architectural families: fusion-encoder models that jointly attend over image-text pairs (strong on multimodal reasoning, but requiring paired inputs) and dual-encoder models that encode each modality separately (efficient for retrieval and uni-modal tasks, but weaker at fine-grained fusion). A model trained as one type could not serve the task formats native to the other.
PTUnifier, introduced by Zhihong Chen and colleagues at the Chinese University of Hong Kong, Shenzhen, Sun Yat-sen University, and the Shenzhen Research Institute of Big Data, unifies these two paradigms within a single pretrained model. Published at ICCV 2023 (preprint February 2023), it inserts learnable visual prompts and textual prompts that act as a feature bank storing the most representative images and texts. When an input lacks one modality, the corresponding prompts stand in for the missing image or text, letting the same network process image-only, text-only, and image-text-pair inputs without architectural changes.
By making prompts the bridge between encoder types, PTUnifier turns what had been a hard design choice into a single configurable model that covers uni-modal, cross-modal, and multimodal medical tasks.
PTUnifier builds on a METER-style transformer backbone, combining a vision transformer image encoder and a text encoder with a multimodal fusion module, and is pretrained with standard Med-VLP objectives including masked language modeling, image-text matching, and image-text contrastive learning. Pretraining uses radiology-focused image-text corpora — ROCO, MIMIC-CXR, and MediCAT. Evaluation spans uni-modal tasks (multi-label classification on CheXpert, RSNA Pneumonia classification, RadNLI, report summarization), cross-modal tasks (ROCO retrieval and MIMIC-CXR report generation), and multimodal visual question answering on VQA-RAD, SLAKE, and MedVQA-2019, where the unified model reports competitive-to-state-of-the-art results across the suite at the time of publication. Pretrained checkpoints and training/fine-tuning code are released in the official repository.
PTUnifier is aimed at researchers building clinical AI systems over radiology data, especially chest X-rays and broader radiology image-report collections. Because one pretrained model serves retrieval, report generation, summarization, classification, and VQA, teams can fine-tune a single backbone across an entire medical imaging workflow instead of maintaining separate fusion- and dual-encoder models. This is useful for prototyping diagnostic assistants, report-drafting tools, and image search over institutional archives.
PTUnifier demonstrated that the long-standing fusion- versus dual-encoder divide in medical vision-language pretraining could be resolved with a prompt bank rather than a new architecture, and its ICCV 2023 publication and public code made it a reference point for subsequent unified Med-VLP work (for example, later efforts such as MedUnifier that add vision-generation objectives). Its main limitations mirror the field: pretraining is concentrated on radiology and chest X-ray data, MIMIC-CXR access requires PhysioNet credentialing, and performance on modalities outside the training distribution is not guaranteed.
Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.
DOI: 10.1109/ICCV51070.2023.02139Chen, Z., et al. (2023) Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts. IEEE International Conference on Computer Vision.
DOI: 10.48550/arXiv.2302.08958Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data