Hong Kong Baptist University / Johns Hopkins University
A Mixture-of-Experts foundation model for medical multimodal image segmentation that generalizes across imaging modalities and clinical centers.
M4oE (Medical Multimodal Mixture of Experts) is a foundation model for medical image segmentation designed to handle the heterogeneity that arises when imaging data are drawn from different modalities and different clinical centers. A persistent obstacle in medical imaging is that a model trained on one modality or one institution's acquisition protocol often degrades sharply when applied elsewhere, because anatomical appearance, contrast, and noise characteristics vary substantially across CT, MRI, and other scanners. M4oE tackles this by dedicating modality-specific experts that each capture domain knowledge for a particular data source, while a learned gating network dynamically weights their contributions.
The model was introduced by Yufeng Jiang (Hong Kong Baptist University) and Yiqing Shen (Johns Hopkins University) in a paper first posted to arXiv in May 2024 and subsequently accepted to MICCAI 2024, one of the principal venues for medical image computing. It sits within the recent wave of generalist medical segmentation models—alongside efforts such as STU-Net, MED3D, and SAM-Med2D—but distinguishes itself by using a Mixture-of-Experts (MoE) formulation to achieve cross-modality and cross-center generalization rather than relying on a single monolithic backbone.
By routing each input through the most relevant experts, M4oE aims to deliver strong segmentation accuracy across diverse datasets while keeping the active parameter footprint small, an attractive property for clinical deployment where compute and annotation budgets are constrained.
M4oE adopts a Mixture-of-Experts framework on top of a SwinUNet (Swin Transformer encoder with a U-Net decoder). Modality-specific experts are initialized independently to learn features that encode the domain characteristics of their respective modalities, and a gating network produces weights that combine expert outputs dynamically during fine-tuning. The authors evaluate the model across three modalities using three public abdominal and lesion segmentation datasets: FLARE22, AMOS2022, and ATLAS2023. On these benchmarks M4oE reports improvements of approximately 3.45% over STU-Net-L, 5.11% over MED3D, and 11.93% over SAM-Med2D, while using only about 30% of the parameters of comparison methods and reducing training duration by roughly 7 hours. The full codebase (architecture, training, and inference scripts) is public on GitHub, but the repository ships no LICENSE file, so the code is all-rights-reserved by default even though the paper itself is CC-BY-4.0. The only released weights are a third-party pretrained Swin Transformer initialization (linked via Google Drive)—not trained M4oE model weights—so users must train the model themselves, with the option to pretrain on custom datasets via masked autoencoding (MAE).
M4oE targets multi-organ and lesion segmentation tasks in clinical and research radiology, where labeled data are scarce and acquisition protocols vary widely across hospitals. Its expert-routing design is well suited to settings that must process several imaging modalities—such as multi-phase abdominal CT and MRI—under a single deployable model. Researchers building generalist medical segmentation pipelines benefit from the reduced parameter and training cost, and clinical teams gain a framework intended to remain robust when moved between institutions without retraining a separate model for each site.
As an MICCAI 2024 contribution, M4oE adds to the growing body of work showing that conditional computation and Mixture-of-Experts routing can address the modality and domain-shift problems that limit conventional medical segmentation networks. Its emphasis on efficiency—matching or exceeding larger baselines with a fraction of the parameters—reflects a broader push toward practical, deployable medical foundation models rather than parameter-heavy generalists. The model is relatively new and evaluated on three datasets, so its generalization to additional modalities and larger multi-center cohorts remains to be established. The public (though unlicensed) code repository lets others reproduce the architecture and training recipe for follow-up work extending the expert-routing approach to new clinical settings, though reusers should note that no trained M4oE weights are distributed and the code carries no open-source license.
Jiang, Y. & Shen, Y. (2024) M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.1007/978-3-031-72390-2_58Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data