A generalist medical multimodal LLM built on Qwen2.5-VL for unified medical image understanding, visual question answering, report generation, and clinical reasoning across 12+ imaging modalities.
Lingshu is a generalist medical multimodal large language model (MLLM) developed by the LASA Team at Alibaba DAMO Academy and Hupan Lab, released as an arXiv preprint in June 2025. It targets unified medical understanding and reasoning across both images and text, addressing three recurring limitations of prior medical MLLMs: a narrow scope of medical knowledge, elevated hallucination risk, and weak multi-step reasoning in complex clinical scenarios.
Rather than specializing in a single modality, Lingshu supports more than twelve medical imaging types—including X-ray, CT, MRI, ultrasound, histopathology, and fundus photography—within one model. The authors pair this breadth with a comprehensive data curation pipeline that assembles knowledge from medical images, medical text, and general-domain data, then synthesizes accurate captions, visual question-answering (VQA) samples, and reasoning traces to teach the model both perception and clinical inference.
Lingshu sits in the fast-growing space of open medical foundation models alongside efforts such as LLaVA-Med and Med-Gemini, but is distinguished by its multi-stage training recipe, reinforcement learning with verifiable rewards, and an accompanying open evaluation toolkit (MedEvalKit). The model is released under an MIT license in 7B, 8B, and 32B variants on HuggingFace.
Lingshu is built on the Qwen2.5-VL vision-language architecture, combining a vision transformer image encoder with a transformer language model, and is released in 7B, 8B, and 32B-parameter configurations. Training proceeds through multiple stages of curation and supervised learning followed by reinforcement learning with verifiable rewards. On the 7B model, the authors report a medical multimodal VQA average of 61.8% and a medical textual QA average of 52.8%, with strong report-generation scores on MIMIC-CXR, CheXpert Plus, and IU-Xray (e.g., ROUGE-L 30.8, CIDEr 109.4, RaTE 52.1). The flagship Lingshu-32B is reported to outperform leading proprietary systems including GPT-4.1 and Claude Sonnet 4 on most multimodal QA and report-generation tasks, while consistently surpassing existing open-source medical MLLMs.
Lingshu is aimed at clinical and research settings that require reasoning over heterogeneous medical imaging and text. Practical use cases include medical visual question answering across radiology, pathology, and ophthalmology images; automated drafting of radiology reports from chest X-rays; answering text-based clinical and exam-style questions; and supporting decision workflows that require multi-step diagnostic reasoning. Because the weights are openly licensed in multiple sizes, hospitals, biomedical NLP groups, and developers can fine-tune Lingshu on local data or integrate it into downstream clinical-AI pipelines.
As an openly licensed, multi-size medical MLLM that reports outperforming both open-source peers and frontier proprietary models on many medical benchmarks, Lingshu lowers the barrier to building competitive medical reasoning systems without API dependence. The accompanying MedEvalKit further contributes a shared evaluation standard for the field, which can improve the comparability of future medical MLLM results. As a 2025 preprint, its long-term clinical impact remains to be established, and—like all medical LLMs—its outputs require expert validation and carry hallucination and safety risks that preclude unsupervised clinical use.
Team, L., et al. (2025) Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv.org.
DOI: 10.48550/arXiv.2506.07044Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data