Qilin-Med-VL

Chinese medical vision-language model pairing a Vision Transformer with an LLM to caption medical images and answer clinical questions in Chinese.

Released: October 2023

Qilin-Med-VL is, according to its authors, the first Chinese large vision-language model (VLM) built specifically for general healthcare. Released in October 2023 by Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua—a team led from Alibaba Group with academic collaborators—it targets a notable gap: the great majority of medical multimodal models are English-centric, leaving Chinese-language clinical and biomedical imagery poorly served. By jointly reasoning over a medical image and a Chinese-language prompt, the model can generate descriptive captions and answer free-form clinical questions about the visual content.

The model follows the LLaVA recipe for connecting vision and language: a pretrained Vision Transformer (ViT) image encoder is bridged to a foundational large language model through a learned projection, so that visual tokens become inputs the LLM can attend to alongside text. Rather than training from scratch, Qilin-Med-VL adapts these pretrained components to the medical domain via curriculum-style fine-tuning, making it relatively economical to produce while still covering diverse image types found in the biomedical literature.

Alongside the model, the team released ChiMed-VL, a large Chinese medical image-text corpus that doubles as the training resource and a benchmark for future Chinese medical VLMs. Together they form part of the broader Qilin-Med family of Chinese medical AI models from the same group.

Key Features

Chinese-first medical multimodality: Designed to interpret medical images and respond to questions in Chinese, addressing the scarcity of non-English medical vision-language systems.
LLaVA-style architecture: Couples a pretrained ViT image encoder with a foundational LLM through a projection layer, reusing strong unimodal backbones instead of training a multimodal model from scratch.
Two-stage curriculum training: A feature-alignment stage first connects the vision and language representations, followed by an instruction-tuning stage that teaches the model to follow clinical prompts and hold dialogue.
ChiMed-VL dataset: A curated corpus of more than 1M image-text pairs spanning radiology, pathology, and other biomedical image types, released to support reproducibility and benchmarking.
Open weights: Base and chat checkpoints are distributed on Hugging Face under an Apache 2.0 license, enabling downstream research and adaptation.

Technical Details

Qilin-Med-VL is a vision-language transformer assembled in the LLaVA paradigm: a pretrained Vision Transformer encodes the input image into visual tokens, a trainable projection maps those tokens into the language model's embedding space, and a foundational LLM consumes the combined image and Chinese-text sequence to generate responses. Training proceeds in two stages—feature alignment, which tunes the cross-modal connector so visual and textual features are compatible, and instruction tuning, which adapts the model to follow medical questions and multi-turn prompts. The ChiMed-VL dataset underpins both stages and is organized into an alignment split of 580,014 context-and-description image-text pairs and an instruction split of 469,441 question-answer pairs, exceeding one million pairs in total; much of the imagery is drawn from PubMed Central figures with Chinese captions and translations. Weights are released as Qilin-Med-VL (base) and Qilin-Med-VL-Chat checkpoints on Hugging Face. The repositories ship only minimal model and dataset cards, so architectural specifics and quantitative benchmark scores are documented primarily in the paper rather than in standalone cards.

Applications

Qilin-Med-VL is aimed at Chinese-language clinical and research settings where a practitioner or researcher wants to query a medical image in natural language: generating draft captions for radiology or pathology figures, answering exam-style or descriptive visual questions, and supporting medical education and literature understanding. Because the weights and the ChiMed-VL corpus are openly available, it also serves as a base for researchers building or benchmarking Chinese medical multimodal assistants. As with all such systems, outputs are research artifacts and require expert validation before any clinical use.

Impact

Qilin-Med-VL helped establish Chinese medical vision-language modeling as a distinct research direction, and its accompanying ChiMed-VL dataset has become a reusable resource for training and evaluating later Chinese medical VLMs. By demonstrating that LLaVA-style adaptation of pretrained vision and language backbones can be extended to a new language and the medical domain with an openly released corpus, it lowered the barrier for non-English medical multimodal research. Its main limitations are the sparse model and dataset documentation and the dependence on literature-derived images, which constrain coverage of real clinical workflows.

Citation

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Preprint

Liu, J., et al. (2023) Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare. arXiv.org.

DOI: 10.48550/arXiv.2310.17956

Recent citations

Papers that recently cited this model.

Benchmarking multimodal large language models for medicinal plant identification.
Yue Jiang, Zhenzhong Dai, Wen Jin, et al.
Frontiers in Plant Science · Jun 2026
0
Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Yifan Gao, Tao Zhou, Yi Zhou, et al.
Apr 2026
0
LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration
Gökçe İnal, Pouyan Navard, Alper Yilmaz
Mar 2026
0

Top citations

The most-cited papers that cite this model.

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, et al.
arXiv.org · May 2023
347
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Hongjian Zhou, Boyang Gu, Xinyu Zou, et al.
arXiv.org · Nov 2023
234
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni, Federico Cocchi, Luca Barsellotti, et al.
Annual Meeting of the Association for Computational Linguistics · Feb 2024
185
Benchmarking Large Language Models on CMExam - A Comprehensive Chinese Medical Exam Dataset
Junling Liu, Peilin Zhou, Y. Hua, et al.
Neural Information Processing Systems · Jun 2023
136
Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134

Citations

Total Citations82

Influential6

References31

GitHub

Stars65

Forks10

Open Issues7

Contributors2

Last Push2y ago

LanguagePython

HuggingFace

Downloads6

Likes1

Last Modified2y ago

Pipelinetext-generation

Fields of citing research

Computer Science100%
Medicine71%
Engineering7%
Linguistics6%
Environmental Science4%
Physics2%
Business1%
Biology1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

44Partial

Usability — can I run it?47

Reproducibility — can I retrain it?42

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Chinese-first medical multimodality: Designed to interpret medical images and respond to questions in Chinese, addressing the scarcity of non-English medical vision-language systems.

LLaVA-style architecture: Couples a pretrained ViT image encoder with a foundational LLM through a projection layer, reusing strong unimodal backbones instead of training a multimodal model from scratch.

Two-stage curriculum training: A feature-alignment stage first connects the vision and language representations, followed by an instruction-tuning stage that teaches the model to follow clinical prompts and hold dialogue.

ChiMed-VL dataset: A curated corpus of more than 1M image-text pairs spanning radiology, pathology, and other biomedical image types, released to support reproducibility and benchmarking.

Open weights: Base and chat checkpoints are distributed on Hugging Face under an Apache 2.0 license, enabling downstream research and adaptation.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Benchmarking multimodal large language models for medicinal plant identification.

Yue Jiang, Zhenzhong Dai, Wen Jin, et al.

Frontiers in Plant Science · Jun 2026

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao, Tao Zhou, Yi Zhou, et al.

Apr 2026

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gökçe İnal, Pouyan Navard, Alper Yilmaz

Mar 2026

Qilin-Med-VL

#Key Features

#Technical Details

#Applications

#Impact

Citation

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Recent citations

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Qilin-Med-VL

#Key Features

#Technical Details

#Applications

#Impact

Citation

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Recent citations

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact