bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

Lingshu

DAMO Academy / Hupan Lab

A generalist medical multimodal LLM built on Qwen2.5-VL for unified medical image understanding, visual question answering, report generation, and clinical reasoning across 12+ imaging modalities.

Released: June 2025

Lingshu is a generalist medical multimodal large language model (MLLM) developed by the LASA Team at Alibaba DAMO Academy and Hupan Lab, released as an arXiv preprint in June 2025. It targets unified medical understanding and reasoning across both images and text, addressing three recurring limitations of prior medical MLLMs: a narrow scope of medical knowledge, elevated hallucination risk, and weak multi-step reasoning in complex clinical scenarios.

Rather than specializing in a single modality, Lingshu supports more than twelve medical imaging types—including X-ray, CT, MRI, ultrasound, histopathology, and fundus photography—within one model. The authors pair this breadth with a comprehensive data curation pipeline that assembles knowledge from medical images, medical text, and general-domain data, then synthesizes accurate captions, visual question-answering (VQA) samples, and reasoning traces to teach the model both perception and clinical inference.

Lingshu sits in the fast-growing space of open medical foundation models alongside efforts such as LLaVA-Med and Med-Gemini, but is distinguished by its multi-stage training recipe, reinforcement learning with verifiable rewards, and an accompanying open evaluation toolkit (MedEvalKit). The model is released under an MIT license in 7B, 8B, and 32B variants on HuggingFace.

#Key Features

  • Unified multimodal coverage: A single model handles 12+ imaging modalities (X-ray, CT, MRI, ultrasound, histopathology, fundus, and more) together with medical text, supporting VQA, report generation, and textual QA.
  • Comprehensive data curation: Training data is curated from medical imaging, medical texts, and general corpora, with synthesized captions, VQA pairs, and reasoning samples to inject domain knowledge while limiting hallucination.
  • Multi-stage training with RLVR: A staged training procedure culminates in reinforcement learning with verifiable rewards (RLVR), explicitly strengthening complex clinical reasoning beyond supervised fine-tuning.
  • Open model family: Released under MIT in 7B, 8B, and 32B sizes, allowing local deployment, fine-tuning, and reproduction across a range of compute budgets.
  • MedEvalKit benchmark suite: The team ships a standardized evaluation framework that consolidates major multimodal and text-based medical benchmarks to enable consistent, comparable assessment.

#Technical Details

Lingshu is built on the Qwen2.5-VL vision-language architecture, combining a vision transformer image encoder with a transformer language model, and is released in 7B, 8B, and 32B-parameter configurations. Training proceeds through multiple stages of curation and supervised learning followed by reinforcement learning with verifiable rewards. On the 7B model, the authors report a medical multimodal VQA average of 61.8% and a medical textual QA average of 52.8%, with strong report-generation scores on MIMIC-CXR, CheXpert Plus, and IU-Xray (e.g., ROUGE-L 30.8, CIDEr 109.4, RaTE 52.1). The flagship Lingshu-32B is reported to outperform leading proprietary systems including GPT-4.1 and Claude Sonnet 4 on most multimodal QA and report-generation tasks, while consistently surpassing existing open-source medical MLLMs.

#Applications

Lingshu is aimed at clinical and research settings that require reasoning over heterogeneous medical imaging and text. Practical use cases include medical visual question answering across radiology, pathology, and ophthalmology images; automated drafting of radiology reports from chest X-rays; answering text-based clinical and exam-style questions; and supporting decision workflows that require multi-step diagnostic reasoning. Because the weights are openly licensed in multiple sizes, hospitals, biomedical NLP groups, and developers can fine-tune Lingshu on local data or integrate it into downstream clinical-AI pipelines.

#Impact

As an openly licensed, multi-size medical MLLM that reports outperforming both open-source peers and frontier proprietary models on many medical benchmarks, Lingshu lowers the barrier to building competitive medical reasoning systems without API dependence. The accompanying MedEvalKit further contributes a shared evaluation standard for the field, which can improve the comparability of future medical MLLM results. As a 2025 preprint, its long-term clinical impact remains to be established, and—like all medical LLMs—its outputs require expert validation and carry hallucination and safety risks that preclude unsupervised clinical use.

Citation

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Preprint

Team, L., et al. (2025) Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv.org.

DOI: 10.48550/arXiv.2506.07044

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations154
Influential24
References0

GitHub

Stars3
Forks0
Open Issues0
Contributors2
Last Push8mo ago
LanguageHTML

HuggingFace

Downloads3.8K
Likes77
Last Modified8mo ago
Pipelineimage-text-to-text

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
70Open
Usability — can I run it?100
Reproducibility — can I retrain it?30
open weights, closed recipe
Model Openness Framework
Unclassified
No formal model card / data card

Tags

clinical_reasoningfoundation_modelhistologymedical_visual_question_answeringmultimodalradiologyreinforcement_learningreport_generationtransformervision_transformer

Resources

GitHub RepositoryResearch PaperHuggingFace Model