Med-MoE

Zhejiang University / National University of Singapore / Peking University

Lightweight mixture-of-experts medical vision-language model routing visual question answering and image classification to domain-specific experts.

Released: April 2024

Med-MoE is a lightweight medical vision-language framework that brings the mixture-of-experts (MoE) paradigm to multimodal clinical AI. It was developed by researchers at Zhejiang University, the National University of Singapore, and Peking University, and published in the Findings of the Association for Computational Linguistics: EMNLP 2024. The model targets a practical problem in medical AI deployment: state-of-the-art medical multimodal large language models such as LLaVA-Med are powerful but heavy, making them difficult to run in resource-constrained clinical settings where compute, memory, and latency budgets are tight.

Rather than scaling up a single dense model, Med-MoE replaces the feed-forward layers of a compact language backbone with a set of domain-specific experts that are selectively activated by a trainable router. Different medical imaging domains—such as radiology and pathology—are handled by experts specialized for that data, while a shared meta expert captures cross-domain knowledge. Because only a subset of experts fire for any given input, the model activates roughly 30-50% of its parameters per forward pass, delivering the capacity benefits of a larger model at a fraction of the inference cost.

The framework addresses both discriminative tasks (medical image classification) and generative tasks (open- and closed-ended visual question answering) within a single unified architecture, positioning it as an efficiency-focused alternative to larger medical VLMs for VQA and classification workflows.

Key Features

Domain-specific expert routing: A trainable router selects among experts specialized for distinct medical imaging domains, with a meta expert that retains shared, cross-domain knowledge.
Sparse activation for efficiency: Only approximately 30-50% of model parameters are activated per input, substantially lowering inference compute relative to dense medical VLMs of comparable capability.
Compact language backbones: Built on lightweight LLMs—Phi-2 (2.7B) and StableLM-1.6B—making the framework deployable in constrained environments.
Three-stage training recipe: Multimodal medical alignment, instruction tuning with trainable routing, and domain-specific MoE tuning are applied in sequence to specialize the experts.
Unified discriminative and generative handling: One model covers both image classification and open/closed-ended VQA rather than requiring separate task-specific systems.

Technical Details

Med-MoE converts a dense compact LLM into a sparse mixture-of-experts model by expanding selected feed-forward blocks into multiple expert copies governed by a learned router. Two backbones are released: Phi-2 (2.7B parameters) and StableLM-1.6B. Training proceeds in three phases—first aligning medical image features to the LLM token space, then instruction tuning while learning the routing function, and finally domain-specific MoE tuning that couples the router with the selectively activated experts. Training data is drawn from the LLaVA-Med data pipeline. The model is evaluated on the standard medical VQA benchmarks VQA-RAD, SLAKE, and Path-VQA, plus medical image classification, where it reports performance on par with or exceeding state-of-the-art baselines while activating only about 30-50% of its parameters. Code and three-stage checkpoints are released under the Apache-2.0 license.

Applications

Med-MoE is suited to medical visual question answering and medical image classification across radiology and pathology imaging, where it can answer open-ended and closed-ended questions about scans or histology images. Its small footprint and sparse activation make it attractive for research groups and clinical-adjacent settings that need multimodal medical reasoning without the infrastructure required to serve large dense models, including on-premise or edge-style deployments where data governance and latency matter.

Impact

Med-MoE demonstrates that mixture-of-experts routing can deliver competitive medical vision-language performance at a fraction of the activated parameters, offering a concrete path toward efficient, deployable clinical multimodal models. By open-sourcing code, data pipelines, and weights for both backbones under a permissive license, the authors lowered the barrier for reproducing and extending lightweight medical VLMs. The work contributes to a growing line of research applying sparse expert architectures to specialized biomedical domains, where heterogeneous imaging modalities make domain-specialized experts a natural fit. Its main limitations stem from the modest scale of its backbones and its focus on VQA and classification benchmarks rather than broader clinical tasks.

Citation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Jiang, S., et al. (2024) Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. Conference on Empirical Methods in Natural Language Processing.

DOI: 10.18653/v1/2024.findings-emnlp.221

Recent citations

Papers that recently cited this model.

Medical Question Answering: A Comprehensive Multimodal and LLM-Driven Survey
Eya Mhedhbi, Xiang Zhu, Muhammad Ayaz, et al.
Computer Methods and Programs in Biomedicine · Jul 2026
0
\(M^{3}QuestionIng\) : Multi-modal Multi-span Medical Question Answering
Anisha Saha, Vaibhav Rathore, Abhishek Tiwari, et al.
ACM Transactions on Computing for Healthcare · Jun 2026
0
ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
T. Halder, Akash Ghosh, Subhadip Baidya, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Efficient multimodal large language models: a survey
Yizhang Jin, Jian Li, Yexin Liu, et al.
Visual Intelligence · May 2024
118
Large language models for disease diagnosis: a scoping review
Shuang Zhou, Zidu Xu, Mian Zhang, et al.
npj Artificial Intelligence · Jun 2025
64
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar, Latif U. Khan, Waseem Ullah, et al.
IEEE Internet of Things Journal · Feb 2025
45
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin, Zixu Lin, Haoyu Chen, et al.
Computer Vision and Pattern Recognition · Apr 2025
40
From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine
L. Buess, Matthias Keicher, Nassir Navab, et al.
Biomedical Engineering Letters · Feb 2025
35

Citations

Total Citations89

Influential3

References63

GitHub

Stars158

Forks12

Open Issues0

Contributors2

Last Push1y ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads0

Likes0

Last Modified1y ago

Fields of citing research

Computer Science99%
Medicine71%
Engineering11%
Linguistics6%
Environmental Science2%
Physics1%
Mathematics1%
Education1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

81Open

Usability — can I run it?91

Reproducibility — can I retrain it?70

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Domain-specific expert routing: A trainable router selects among experts specialized for distinct medical imaging domains, with a meta expert that retains shared, cross-domain knowledge.

Sparse activation for efficiency: Only approximately 30-50% of model parameters are activated per input, substantially lowering inference compute relative to dense medical VLMs of comparable capability.

Compact language backbones: Built on lightweight LLMs—Phi-2 (2.7B) and StableLM-1.6B—making the framework deployable in constrained environments.

Three-stage training recipe: Multimodal medical alignment, instruction tuning with trainable routing, and domain-specific MoE tuning are applied in sequence to specialize the experts.

Unified discriminative and generative handling: One model covers both image classification and open/closed-ended VQA rather than requiring separate task-specific systems.

Technical Details

Applications

Impact

Citation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Jiang, S., et al. (2024) Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. Conference on Empirical Methods in Natural Language Processing.

DOI: 10.18653/v1/2024.findings-emnlp.221

Recent citations

Papers that recently cited this model.

Medical Question Answering: A Comprehensive Multimodal and LLM-Driven Survey

Eya Mhedhbi, Xiang Zhu, Muhammad Ayaz, et al.

Computer Methods and Programs in Biomedicine · Jul 2026

\(M^{3}QuestionIng\) : Multi-modal Multi-span Medical Question Answering

Anisha Saha, Vaibhav Rathore, Abhishek Tiwari, et al.

ACM Transactions on Computing for Healthcare · Jun 2026

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

T. Halder, Akash Ghosh, Subhadip Baidya, et al.

Jun 2026

Med-MoE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Recent citations

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Med-MoE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Recent citations

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact