ARL (Align, Reason and Learn)

Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen / Sun Yat-sen University

Medical vision-language pretraining framework that injects structured medical knowledge into radiology image-text learning for VQA and retrieval.

Released: September 2022

ARL (Align, Reason and Learn) is a medical vision-and-language pre-training (Med-VLP) framework that explicitly incorporates structured medical knowledge into the learning of joint image-text representations for radiology. It was introduced by Zhihong Chen, Guanbin Li, and Xiang Wan in the paper "Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge," presented at the 30th ACM International Conference on Multimedia (ACM MM 2022).

General-domain vision-and-language models such as ViLT and METER learn alignments between images and text from large web corpora, but they lack the specialized domain expertise that clinicians bring to interpreting medical images. ARL addresses this gap by treating curated medical knowledge as an intermediate medium that connects the visual and linguistic modalities. Rather than relying solely on co-occurrence statistics in image-report pairs, the model injects knowledge from a medical knowledge base to better align entities (for example, anatomical structures and findings), to reason over multi-modal evidence, and to design pretext tasks that emphasize clinically salient information.

The framework is pre-trained on radiology image-text data and then fine-tuned on a benchmark of downstream tasks the authors assembled, reporting state-of-the-art results across all of them at the time of publication. ARL sits in the lineage of knowledge-augmented medical multimodal models and has served as a reference baseline for subsequent Med-VLP work.

Key Features

Knowledge-guided alignment: Uses structured medical knowledge as the intermediate medium between vision and language, aligning uni-modal encoder representations around shared clinical entities rather than relying only on raw image-text correspondence.
Reasoning-enhanced fusion: Augments the multi-modal fusion module with knowledge so it can reason over combined visual and textual evidence, improving performance on tasks that require clinical inference.
Knowledge-driven pretext tasks: Designs pre-training objectives that direct the model's attention toward the most diagnostically important information in images and reports.
Unified Med-VLP benchmark: Introduces a medical vision-and-language benchmark spanning three downstream task types to standardize evaluation of pre-trained models.
Open implementation: Code and pre-training configuration are released publicly, building on the OpenKE, ViLT, METER, and MAE codebases.

Technical Details

ARL follows a dual-encoder plus fusion architecture: separate transformer-based uni-modal encoders process the radiology image and the associated text, and a multi-modal fusion transformer integrates the two streams, all enhanced with a medical knowledge component. The implementation builds on the METER framework and initializes from METER weights, with knowledge extraction drawing on OpenKE. Pre-training data are drawn from radiology image-text corpora including ROCO, MedICaT, and MIMIC-CXR (with associated metadata and CheXpert labels). The model is evaluated on downstream tasks including medical visual question answering (VQA-RAD, SLAKE, and MedVQA-2019) and medical image classification (MELINDA), where the authors report state-of-the-art accuracy across the benchmark relative to prior Med-VLP baselines. The released repository is implemented primarily in Python and provides scripts for both pre-training and fine-tuning.

Applications

ARL targets radiology workflows where image understanding must be coupled with clinical text. Its primary applications are medical visual question answering, where a model answers natural-language questions about a radiograph, and medical image classification, where reports and images jointly inform diagnostic labels. Such capabilities are useful for clinical decision support, automated report analysis, education, and as a pre-trained backbone that downstream researchers can fine-tune for specialized radiology tasks with limited labeled data.

Impact

By demonstrating that explicitly injecting structured medical knowledge improves vision-and-language pre-training, ARL helped establish knowledge enhancement as a recurring theme in subsequent Med-VLP research and served as a competitive baseline for later models such as soft-prompt-based unified Med-VLP approaches. As an ACM MM 2022 paper with public code, it lowered the barrier for reproducing and extending knowledge-augmented medical multimodal pre-training. Its scope is limited to the radiology image-text domain and to the specific benchmarks evaluated, so results may not transfer directly to other imaging modalities or to settings without curated knowledge bases.

Citation

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Chen, Z., et al. (2022) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge. ACM Multimedia.

DOI: 10.1145/3503161.3547948

Recent citations

Papers that recently cited this model.

Research on the application of LLaVA model based on QLoRA fine-tuning in medical teaching
Shiling Zhou, Fengmei Qin
PLoS ONE · Jul 2026
0
Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions
Le Zou, Mengyu Ma, Jun Li, et al.
Italian National Conference on Sensors · Jun 2026
0
Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography
Bowen Shi, Weiwei Cao, Ruifeng Yuan, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He, Rui Mao, Qika Lin, et al.
Information Fusion · Oct 2023
328
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Mar 2023
313
Knowledge-enhanced visual-language pre-training on chest radiology images
Xiaoman Zhang, Chaoyi Wu, Ya Zhang, et al.
Nature Communications · Feb 2023
245
Pre-trained Language Models in Biomedical Domain: A Systematic Survey
Benyou Wang, Qianqian Xie, Jiahuan Pei, et al.
ACM Computing Surveys · Oct 2021
238
MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, et al.
IEEE International Conference on Computer Vision · Jan 2023
224

Citations

Total Citations113

Influential7

References61

GitHub

Stars38

Forks2

Open Issues6

Contributors1

Last Push3y ago

LanguagePython

Fields of citing research

Computer Science100%
Medicine88%
Engineering12%
Linguistics4%
Physics1%
Agricultural and Food Sciences1%
Biology1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

29Closed

Usability — can I run it?29

Reproducibility — can I retrain it?15

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Knowledge-guided alignment: Uses structured medical knowledge as the intermediate medium between vision and language, aligning uni-modal encoder representations around shared clinical entities rather than relying only on raw image-text correspondence.

Reasoning-enhanced fusion: Augments the multi-modal fusion module with knowledge so it can reason over combined visual and textual evidence, improving performance on tasks that require clinical inference.

Knowledge-driven pretext tasks: Designs pre-training objectives that direct the model's attention toward the most diagnostically important information in images and reports.

Unified Med-VLP benchmark: Introduces a medical vision-and-language benchmark spanning three downstream task types to standardize evaluation of pre-trained models.

Open implementation: Code and pre-training configuration are released publicly, building on the OpenKE, ViLT, METER, and MAE codebases.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Research on the application of LLaVA model based on QLoRA fine-tuning in medical teaching

Shiling Zhou, Fengmei Qin

PLoS ONE · Jul 2026

Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions

Le Zou, Mengyu Ma, Jun Li, et al.

Italian National Conference on Sensors · Jun 2026

Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

Bowen Shi, Weiwei Cao, Ruifeng Yuan, et al.

Jun 2026

ARL (Align, Reason and Learn)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Recent citations

Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ARL (Align, Reason and Learn)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Recent citations

Disease-Centric Vision-Language Pretraining with Hybrid Visual Encoding for 3D Computed Tomography

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact