Zhejiang University / National University of Singapore / Peking University
A lightweight mixture-of-experts medical vision-language model that routes between domain-specific experts for VQA and image classification while activating only 30-50% of parameters.
Med-MoE is a lightweight medical vision-language framework that brings the mixture-of-experts (MoE) paradigm to multimodal clinical AI. It was developed by researchers at Zhejiang University, the National University of Singapore, and Peking University, and published in the Findings of the Association for Computational Linguistics: EMNLP 2024. The model targets a practical problem in medical AI deployment: state-of-the-art medical multimodal large language models such as LLaVA-Med are powerful but heavy, making them difficult to run in resource-constrained clinical settings where compute, memory, and latency budgets are tight.
Rather than scaling up a single dense model, Med-MoE replaces the feed-forward layers of a compact language backbone with a set of domain-specific experts that are selectively activated by a trainable router. Different medical imaging domains—such as radiology and pathology—are handled by experts specialized for that data, while a shared meta expert captures cross-domain knowledge. Because only a subset of experts fire for any given input, the model activates roughly 30-50% of its parameters per forward pass, delivering the capacity benefits of a larger model at a fraction of the inference cost.
The framework addresses both discriminative tasks (medical image classification) and generative tasks (open- and closed-ended visual question answering) within a single unified architecture, positioning it as an efficiency-focused alternative to larger medical VLMs for VQA and classification workflows.
Med-MoE converts a dense compact LLM into a sparse mixture-of-experts model by expanding selected feed-forward blocks into multiple expert copies governed by a learned router. Two backbones are released: Phi-2 (2.7B parameters) and StableLM-1.6B. Training proceeds in three phases—first aligning medical image features to the LLM token space, then instruction tuning while learning the routing function, and finally domain-specific MoE tuning that couples the router with the selectively activated experts. Training data is drawn from the LLaVA-Med data pipeline. The model is evaluated on the standard medical VQA benchmarks VQA-RAD, SLAKE, and Path-VQA, plus medical image classification, where it reports performance on par with or exceeding state-of-the-art baselines while activating only about 30-50% of its parameters. Code and three-stage checkpoints are released under the Apache-2.0 license.
Med-MoE is suited to medical visual question answering and medical image classification across radiology and pathology imaging, where it can answer open-ended and closed-ended questions about scans or histology images. Its small footprint and sparse activation make it attractive for research groups and clinical-adjacent settings that need multimodal medical reasoning without the infrastructure required to serve large dense models, including on-premise or edge-style deployments where data governance and latency matter.
Med-MoE demonstrates that mixture-of-experts routing can deliver competitive medical vision-language performance at a fraction of the activated parameters, offering a concrete path toward efficient, deployable clinical multimodal models. By open-sourcing code, data pipelines, and weights for both backbones under a permissive license, the authors lowered the barrier for reproducing and extending lightweight medical VLMs. The work contributes to a growing line of research applying sparse expert architectures to specialized biomedical domains, where heterogeneous imaging modalities make domain-specialized experts a natural fit. Its main limitations stem from the modest scale of its backbones and its focus on VQA and classification benchmarks rather than broader clinical tasks.
Jiang, S., et al. (2024) Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. Conference on Empirical Methods in Natural Language Processing.
DOI: 10.18653/v1/2024.findings-emnlp.221Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data