bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language model foundation models
Language modelPathology

Qilin-Med-VL

Alibaba Group

The first Chinese medical large vision-language model, pairing a pretrained ViT with an LLM to interpret medical images and answer clinical questions in Chinese.

Released: October 2023

Qilin-Med-VL is, according to its authors, the first Chinese large vision-language model (VLM) built specifically for general healthcare. Released in October 2023 by Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua—a team led from Alibaba Group with academic collaborators—it targets a notable gap: the great majority of medical multimodal models are English-centric, leaving Chinese-language clinical and biomedical imagery poorly served. By jointly reasoning over a medical image and a Chinese-language prompt, the model can generate descriptive captions and answer free-form clinical questions about the visual content.

The model follows the LLaVA recipe for connecting vision and language: a pretrained Vision Transformer (ViT) image encoder is bridged to a foundational large language model through a learned projection, so that visual tokens become inputs the LLM can attend to alongside text. Rather than training from scratch, Qilin-Med-VL adapts these pretrained components to the medical domain via curriculum-style fine-tuning, making it relatively economical to produce while still covering diverse image types found in the biomedical literature.

Alongside the model, the team released ChiMed-VL, a large Chinese medical image-text corpus that doubles as the training resource and a benchmark for future Chinese medical VLMs. Together they form part of the broader Qilin-Med family of Chinese medical AI models from the same group.

#Key Features

  • Chinese-first medical multimodality: Designed to interpret medical images and respond to questions in Chinese, addressing the scarcity of non-English medical vision-language systems.
  • LLaVA-style architecture: Couples a pretrained ViT image encoder with a foundational LLM through a projection layer, reusing strong unimodal backbones instead of training a multimodal model from scratch.
  • Two-stage curriculum training: A feature-alignment stage first connects the vision and language representations, followed by an instruction-tuning stage that teaches the model to follow clinical prompts and hold dialogue.
  • ChiMed-VL dataset: A curated corpus of more than 1M image-text pairs spanning radiology, pathology, and other biomedical image types, released to support reproducibility and benchmarking.
  • Open weights: Base and chat checkpoints are distributed on Hugging Face under an Apache 2.0 license, enabling downstream research and adaptation.

#Technical Details

Qilin-Med-VL is a vision-language transformer assembled in the LLaVA paradigm: a pretrained Vision Transformer encodes the input image into visual tokens, a trainable projection maps those tokens into the language model's embedding space, and a foundational LLM consumes the combined image and Chinese-text sequence to generate responses. Training proceeds in two stages—feature alignment, which tunes the cross-modal connector so visual and textual features are compatible, and instruction tuning, which adapts the model to follow medical questions and multi-turn prompts. The ChiMed-VL dataset underpins both stages and is organized into an alignment split of 580,014 context-and-description image-text pairs and an instruction split of 469,441 question-answer pairs, exceeding one million pairs in total; much of the imagery is drawn from PubMed Central figures with Chinese captions and translations. Weights are released as Qilin-Med-VL (base) and Qilin-Med-VL-Chat checkpoints on Hugging Face. The repositories ship only minimal model and dataset cards, so architectural specifics and quantitative benchmark scores are documented primarily in the paper rather than in standalone cards.

#Applications

Qilin-Med-VL is aimed at Chinese-language clinical and research settings where a practitioner or researcher wants to query a medical image in natural language: generating draft captions for radiology or pathology figures, answering exam-style or descriptive visual questions, and supporting medical education and literature understanding. Because the weights and the ChiMed-VL corpus are openly available, it also serves as a base for researchers building or benchmarking Chinese medical multimodal assistants. As with all such systems, outputs are research artifacts and require expert validation before any clinical use.

#Impact

Qilin-Med-VL helped establish Chinese medical vision-language modeling as a distinct research direction, and its accompanying ChiMed-VL dataset has become a reusable resource for training and evaluating later Chinese medical VLMs. By demonstrating that LLaVA-style adaptation of pretrained vision and language backbones can be extended to a new language and the medical domain with an openly released corpus, it lowered the barrier for non-English medical multimodal research. Its main limitations are the sparse model and dataset documentation and the dependence on literature-derived images, which constrain coverage of real clinical workflows.

Citation

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

Preprint

Liu, J., et al. (2023) Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare. arXiv.org.

DOI: 10.48550/arXiv.2310.17956

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations81
Influential7
References31

GitHub

Stars65
Forks10
Open Issues7
Contributors2
Last Push2y ago
LanguagePython

HuggingFace

Downloads6
Likes1
Last Modified2y ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
44Partial
Usability — can I run it?47
Reproducibility — can I retrain it?42
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

histologyinstruction_tuningmedical_image_captioningmultimodalradiologytransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset