The first Chinese medical large vision-language model, pairing a pretrained ViT with an LLM to interpret medical images and answer clinical questions in Chinese.
Qilin-Med-VL is, according to its authors, the first Chinese large vision-language model (VLM) built specifically for general healthcare. Released in October 2023 by Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua—a team led from Alibaba Group with academic collaborators—it targets a notable gap: the great majority of medical multimodal models are English-centric, leaving Chinese-language clinical and biomedical imagery poorly served. By jointly reasoning over a medical image and a Chinese-language prompt, the model can generate descriptive captions and answer free-form clinical questions about the visual content.
The model follows the LLaVA recipe for connecting vision and language: a pretrained Vision Transformer (ViT) image encoder is bridged to a foundational large language model through a learned projection, so that visual tokens become inputs the LLM can attend to alongside text. Rather than training from scratch, Qilin-Med-VL adapts these pretrained components to the medical domain via curriculum-style fine-tuning, making it relatively economical to produce while still covering diverse image types found in the biomedical literature.
Alongside the model, the team released ChiMed-VL, a large Chinese medical image-text corpus that doubles as the training resource and a benchmark for future Chinese medical VLMs. Together they form part of the broader Qilin-Med family of Chinese medical AI models from the same group.
Qilin-Med-VL is a vision-language transformer assembled in the LLaVA paradigm: a
pretrained Vision Transformer encodes the input image into visual tokens, a
trainable projection maps those tokens into the language model's embedding space,
and a foundational LLM consumes the combined image and Chinese-text sequence to
generate responses. Training proceeds in two stages—feature alignment, which tunes
the cross-modal connector so visual and textual features are compatible, and
instruction tuning, which adapts the model to follow medical questions and
multi-turn prompts. The ChiMed-VL dataset underpins both stages and is organized
into an alignment split of 580,014 context-and-description image-text pairs and an
instruction split of 469,441 question-answer pairs, exceeding one million pairs in
total; much of the imagery is drawn from PubMed Central figures with Chinese
captions and translations. Weights are released as Qilin-Med-VL (base) and
Qilin-Med-VL-Chat checkpoints on Hugging Face. The repositories ship only
minimal model and dataset cards, so architectural specifics and quantitative
benchmark scores are documented primarily in the paper rather than in standalone
cards.
Qilin-Med-VL is aimed at Chinese-language clinical and research settings where a practitioner or researcher wants to query a medical image in natural language: generating draft captions for radiology or pathology figures, answering exam-style or descriptive visual questions, and supporting medical education and literature understanding. Because the weights and the ChiMed-VL corpus are openly available, it also serves as a base for researchers building or benchmarking Chinese medical multimodal assistants. As with all such systems, outputs are research artifacts and require expert validation before any clinical use.
Qilin-Med-VL helped establish Chinese medical vision-language modeling as a distinct research direction, and its accompanying ChiMed-VL dataset has become a reusable resource for training and evaluating later Chinese medical VLMs. By demonstrating that LLaVA-style adaptation of pretrained vision and language backbones can be extended to a new language and the medical domain with an openly released corpus, it lowered the barrier for non-English medical multimodal research. Its main limitations are the sparse model and dataset documentation and the dependence on literature-derived images, which constrain coverage of real clinical workflows.
Liu, J., et al. (2023) Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare. arXiv.org.
DOI: 10.48550/arXiv.2310.17956Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data