Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University
Knowledge-enhanced medical vision-and-language pre-training framework that aligns, reasons over, and learns from structured medical knowledge for radiology image-text tasks.
ARL (Align, Reason and Learn) is a medical vision-and-language pre-training (Med-VLP) framework that explicitly incorporates structured medical knowledge into the learning of joint image-text representations for radiology. It was introduced by Zhihong Chen, Guanbin Li, and Xiang Wan in the paper "Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge," presented at the 30th ACM International Conference on Multimedia (ACM MM 2022).
General-domain vision-and-language models such as ViLT and METER learn alignments between images and text from large web corpora, but they lack the specialized domain expertise that clinicians bring to interpreting medical images. ARL addresses this gap by treating curated medical knowledge as an intermediate medium that connects the visual and linguistic modalities. Rather than relying solely on co-occurrence statistics in image-report pairs, the model injects knowledge from a medical knowledge base to better align entities (for example, anatomical structures and findings), to reason over multi-modal evidence, and to design pretext tasks that emphasize clinically salient information.
The framework is pre-trained on radiology image-text data and then fine-tuned on a benchmark of downstream tasks the authors assembled, reporting state-of-the-art results across all of them at the time of publication. ARL sits in the lineage of knowledge-augmented medical multimodal models and has served as a reference baseline for subsequent Med-VLP work.
ARL follows a dual-encoder plus fusion architecture: separate transformer-based uni-modal encoders process the radiology image and the associated text, and a multi-modal fusion transformer integrates the two streams, all enhanced with a medical knowledge component. The implementation builds on the METER framework and initializes from METER weights, with knowledge extraction drawing on OpenKE. Pre-training data are drawn from radiology image-text corpora including ROCO, MedICaT, and MIMIC-CXR (with associated metadata and CheXpert labels). The model is evaluated on downstream tasks including medical visual question answering (VQA-RAD, SLAKE, and MedVQA-2019) and medical image classification (MELINDA), where the authors report state-of-the-art accuracy across the benchmark relative to prior Med-VLP baselines. The released repository is implemented primarily in Python and provides scripts for both pre-training and fine-tuning.
ARL targets radiology workflows where image understanding must be coupled with clinical text. Its primary applications are medical visual question answering, where a model answers natural-language questions about a radiograph, and medical image classification, where reports and images jointly inform diagnostic labels. Such capabilities are useful for clinical decision support, automated report analysis, education, and as a pre-trained backbone that downstream researchers can fine-tune for specialized radiology tasks with limited labeled data.
By demonstrating that explicitly injecting structured medical knowledge improves vision-and-language pre-training, ARL helped establish knowledge enhancement as a recurring theme in subsequent Med-VLP research and served as a competitive baseline for later models such as soft-prompt-based unified Med-VLP approaches. As an ACM MM 2022 paper with public code, it lowered the barrier for reproducing and extending knowledge-augmented medical multimodal pre-training. Its scope is limited to the radiology image-text domain and to the specific benchmarks evaluated, so results may not transfer directly to other imaging modalities or to settings without curated knowledge bases.
Chen, Z., et al. (2022) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. ACM Multimedia.
DOI: 10.1145/3503161.3547948Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data