bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage modelPathology

Med-MoE

Zhejiang University / National University of Singapore / Peking University

A lightweight mixture-of-experts medical vision-language model that routes between domain-specific experts for VQA and image classification while activating only 30-50% of parameters.

Released: April 2024

Med-MoE is a lightweight medical vision-language framework that brings the mixture-of-experts (MoE) paradigm to multimodal clinical AI. It was developed by researchers at Zhejiang University, the National University of Singapore, and Peking University, and published in the Findings of the Association for Computational Linguistics: EMNLP 2024. The model targets a practical problem in medical AI deployment: state-of-the-art medical multimodal large language models such as LLaVA-Med are powerful but heavy, making them difficult to run in resource-constrained clinical settings where compute, memory, and latency budgets are tight.

Rather than scaling up a single dense model, Med-MoE replaces the feed-forward layers of a compact language backbone with a set of domain-specific experts that are selectively activated by a trainable router. Different medical imaging domains—such as radiology and pathology—are handled by experts specialized for that data, while a shared meta expert captures cross-domain knowledge. Because only a subset of experts fire for any given input, the model activates roughly 30-50% of its parameters per forward pass, delivering the capacity benefits of a larger model at a fraction of the inference cost.

The framework addresses both discriminative tasks (medical image classification) and generative tasks (open- and closed-ended visual question answering) within a single unified architecture, positioning it as an efficiency-focused alternative to larger medical VLMs for VQA and classification workflows.

#Key Features

  • Domain-specific expert routing: A trainable router selects among experts specialized for distinct medical imaging domains, with a meta expert that retains shared, cross-domain knowledge.
  • Sparse activation for efficiency: Only approximately 30-50% of model parameters are activated per input, substantially lowering inference compute relative to dense medical VLMs of comparable capability.
  • Compact language backbones: Built on lightweight LLMs—Phi-2 (2.7B) and StableLM-1.6B—making the framework deployable in constrained environments.
  • Three-stage training recipe: Multimodal medical alignment, instruction tuning with trainable routing, and domain-specific MoE tuning are applied in sequence to specialize the experts.
  • Unified discriminative and generative handling: One model covers both image classification and open/closed-ended VQA rather than requiring separate task-specific systems.

#Technical Details

Med-MoE converts a dense compact LLM into a sparse mixture-of-experts model by expanding selected feed-forward blocks into multiple expert copies governed by a learned router. Two backbones are released: Phi-2 (2.7B parameters) and StableLM-1.6B. Training proceeds in three phases—first aligning medical image features to the LLM token space, then instruction tuning while learning the routing function, and finally domain-specific MoE tuning that couples the router with the selectively activated experts. Training data is drawn from the LLaVA-Med data pipeline. The model is evaluated on the standard medical VQA benchmarks VQA-RAD, SLAKE, and Path-VQA, plus medical image classification, where it reports performance on par with or exceeding state-of-the-art baselines while activating only about 30-50% of its parameters. Code and three-stage checkpoints are released under the Apache-2.0 license.

#Applications

Med-MoE is suited to medical visual question answering and medical image classification across radiology and pathology imaging, where it can answer open-ended and closed-ended questions about scans or histology images. Its small footprint and sparse activation make it attractive for research groups and clinical-adjacent settings that need multimodal medical reasoning without the infrastructure required to serve large dense models, including on-premise or edge-style deployments where data governance and latency matter.

#Impact

Med-MoE demonstrates that mixture-of-experts routing can deliver competitive medical vision-language performance at a fraction of the activated parameters, offering a concrete path toward efficient, deployable clinical multimodal models. By open-sourcing code, data pipelines, and weights for both backbones under a permissive license, the authors lowered the barrier for reproducing and extending lightweight medical VLMs. The work contributes to a growing line of research applying sparse expert architectures to specialized biomedical domains, where heterogeneous imaging modalities make domain-specialized experts a natural fit. Its main limitations stem from the modest scale of its backbones and its focus on VQA and classification benchmarks rather than broader clinical tasks.

Citation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Jiang, S., et al. (2024) Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models. Conference on Empirical Methods in Natural Language Processing.

DOI: 10.18653/v1/2024.findings-emnlp.221

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations81
Influential2
References63

GitHub

Stars158
Forks12
Open Issues0
Contributors2
Last Push11mo ago
LanguagePython
LicenseApache-2.0

HuggingFace

Downloads0
Likes0
Last Modified1y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
81Open
Usability — can I run it?91
Reproducibility — can I retrain it?70
Model Openness Framework
Unclassified
Missing required components

Tags

histologyimage_classificationinstruction_tuningmixture_of_expertsmultimodalradiologytransformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace Model