Lingshu

Generalist medical multimodal LLM for image understanding, visual question answering, and report generation across twelve-plus imaging modalities.

Released: June 2025

Lingshu is a generalist medical multimodal large language model (MLLM) developed by the LASA Team at Alibaba DAMO Academy and Hupan Lab, released as an arXiv preprint in June 2025. It targets unified medical understanding and reasoning across both images and text, addressing three recurring limitations of prior medical MLLMs: a narrow scope of medical knowledge, elevated hallucination risk, and weak multi-step reasoning in complex clinical scenarios.

Rather than specializing in a single modality, Lingshu supports more than twelve medical imaging types—including X-ray, CT, MRI, ultrasound, histopathology, and fundus photography—within one model. The authors pair this breadth with a comprehensive data curation pipeline that assembles knowledge from medical images, medical text, and general-domain data, then synthesizes accurate captions, visual question-answering (VQA) samples, and reasoning traces to teach the model both perception and clinical inference.

Lingshu sits in the fast-growing space of open medical foundation models alongside efforts such as LLaVA-Med and Med-Gemini, but is distinguished by its multi-stage training recipe, reinforcement learning with verifiable rewards, and an accompanying open evaluation toolkit (MedEvalKit). The model is released under an MIT license in 7B, 8B, and 32B variants on HuggingFace.

Key Features

Unified multimodal coverage: A single model handles 12+ imaging modalities (X-ray, CT, MRI, ultrasound, histopathology, fundus, and more) together with medical text, supporting VQA, report generation, and textual QA.
Comprehensive data curation: Training data is curated from medical imaging, medical texts, and general corpora, with synthesized captions, VQA pairs, and reasoning samples to inject domain knowledge while limiting hallucination.
Multi-stage training with RLVR: A staged training procedure culminates in reinforcement learning with verifiable rewards (RLVR), explicitly strengthening complex clinical reasoning beyond supervised fine-tuning.
Open model family: Released under MIT in 7B, 8B, and 32B sizes, allowing local deployment, fine-tuning, and reproduction across a range of compute budgets.
MedEvalKit benchmark suite: The team ships a standardized evaluation framework that consolidates major multimodal and text-based medical benchmarks to enable consistent, comparable assessment.

Technical Details

Lingshu is built on the Qwen2.5-VL vision-language architecture, combining a vision transformer image encoder with a transformer language model, and is released in 7B, 8B, and 32B-parameter configurations. Training proceeds through multiple stages of curation and supervised learning followed by reinforcement learning with verifiable rewards. On the 7B model, the authors report a medical multimodal VQA average of 61.8% and a medical textual QA average of 52.8%, with strong report-generation scores on MIMIC-CXR, CheXpert Plus, and IU-Xray (e.g., ROUGE-L 30.8, CIDEr 109.4, RaTE 52.1). The flagship Lingshu-32B is reported to outperform leading proprietary systems including GPT-4.1 and Claude Sonnet 4 on most multimodal QA and report-generation tasks, while consistently surpassing existing open-source medical MLLMs.

Applications

Lingshu is aimed at clinical and research settings that require reasoning over heterogeneous medical imaging and text. Practical use cases include medical visual question answering across radiology, pathology, and ophthalmology images; automated drafting of radiology reports from chest X-rays; answering text-based clinical and exam-style questions; and supporting decision workflows that require multi-step diagnostic reasoning. Because the weights are openly licensed in multiple sizes, hospitals, biomedical NLP groups, and developers can fine-tune Lingshu on local data or integrate it into downstream clinical-AI pipelines.

Impact

As an openly licensed, multi-size medical MLLM that reports outperforming both open-source peers and frontier proprietary models on many medical benchmarks, Lingshu lowers the barrier to building competitive medical reasoning systems without API dependence. The accompanying MedEvalKit further contributes a shared evaluation standard for the field, which can improve the comparability of future medical MLLM results. As a 2025 preprint, its long-term clinical impact remains to be established, and—like all medical LLMs—its outputs require expert validation and carry hallucination and safety risks that preclude unsupervised clinical use.

Citation

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Preprint

Team, L., et al. (2025) Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv.org.

DOI: 10.48550/arXiv.2506.07044

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models
Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.
Jul 2026
0
The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy
Chunzheng Zhu, Lei Tian, Bohan Tan, et al.
Jul 2026
0
Policy-Driven CT-Agent: Modeling Phase-Aware Diagnostic Control for Clinically Consistent CT Reasoning
Yanmeng Dong, Han Li, Yujia Li, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

Baichuan-M2: Scaling Medical Capability with Large Verifier System
Chengfeng Dou, Chong Liu, Fan Yang, et al.
arXiv.org · Sep 2025
36
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Yu Sun, Xingyu Qian, Weiwen Xu, et al.
Conference on Empirical Methods in Natural Language Processing · Jun 2025
23
Pillar-0: A New Frontier for Radiology Foundation Models
Kumar Krishna Agrawal, Longchao Liu, Long Lian, et al.
arXiv.org · Nov 2025
18Influential
Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation
Jiaying Wu, Zihang Fu, Haonan Wang, et al.
arXiv.org · Oct 2025
13
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Haozhe Gong, Xiaozhong Ji, Yuansen Liu, et al.
arXiv.org · Nov 2025
12

Citations

Total Citations198

Influential29

References0

GitHub

Stars3

Forks0

Open Issues0

Contributors2

Last Push10mo ago

LanguageHTML

HuggingFace

Downloads145.9K

Likes78

Last Modified10mo ago

Pipelineimage-text-to-text

Fields of citing research

Computer Science99%
Medicine94%
Engineering11%
Biology1%
Mathematics1%
Agricultural and Food Sciences1%
Environmental Science1%
Psychology1%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

70Open

Usability — can I run it?100

Reproducibility — can I retrain it?30

open weights, closed recipe

Model Openness Framework

Unclassified

No formal model card / data card

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Unified multimodal coverage: A single model handles 12+ imaging modalities (X-ray, CT, MRI, ultrasound, histopathology, fundus, and more) together with medical text, supporting VQA, report generation, and textual QA.

Comprehensive data curation: Training data is curated from medical imaging, medical texts, and general corpora, with synthesized captions, VQA pairs, and reasoning samples to inject domain knowledge while limiting hallucination.

Multi-stage training with RLVR: A staged training procedure culminates in reinforcement learning with verifiable rewards (RLVR), explicitly strengthening complex clinical reasoning beyond supervised fine-tuning.

Open model family: Released under MIT in 7B, 8B, and 32B sizes, allowing local deployment, fine-tuning, and reproduction across a range of compute budgets.

MedEvalKit benchmark suite: The team ships a standardized evaluation framework that consolidates major multimodal and text-based medical benchmarks to enable consistent, comparable assessment.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.

Jul 2026

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

Chunzheng Zhu, Lei Tian, Bohan Tan, et al.

Jul 2026

Policy-Driven CT-Agent: Modeling Phase-Aware Diagnostic Control for Clinically Consistent CT Reasoning

Yanmeng Dong, Han Li, Yujia Li, et al.

Jul 2026

Lingshu

#Key Features

#Technical Details

#Applications

#Impact

Citation

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

Policy-Driven CT-Agent: Modeling Phase-Aware Diagnostic Control for Clinically Consistent CT Reasoning

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Lingshu

#Key Features

#Technical Details

#Applications

#Impact

Citation

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Recent citations

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

Policy-Driven CT-Agent: Modeling Phase-Aware Diagnostic Control for Clinically Consistent CT Reasoning

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact