University of Oxford / GSK.ai / Amazon Web Services / University of Rochester / Tencent AI Lab / Shanghai Jiao Tong University / Westlake University
Multimodal, multidomain, multilingual medical foundation model that performs zero-shot clinical diagnosis and report generation from chest X-ray and CT images across English and Chinese.
M3FM (Multimodal, Multidomain, Multilingual Foundation Model) is a medical foundation model built for zero-shot clinical diagnosis across imaging modalities, disease domains, and languages. It was developed by researchers at the University of Oxford together with collaborators at GlaxoSmithKline, Amazon, the University of Rochester, Tencent, Shanghai Jiao Tong University, and Westlake University, and published in npj Digital Medicine in February 2025.
A central obstacle in clinical AI is the scarcity of labeled data for rare diseases and for languages other than English, which prevents conventional supervised models from generalizing to new conditions or populations. M3FM addresses this by learning a shared visual-textual representation space from large public medical corpora, so that the aligned visual features can be used to classify diseases and generate diagnostic reports for categories and languages never seen with explicit labels during training.
The model is organized into two components: MultiMedCLIP, which aligns medical images with clinical text across domains and languages, and MultiMedLM, which generates multilingual diagnostic reports. Together they let a single pretrained system support both classification and report generation without task-specific retraining, positioning M3FM among the wave of vision-language foundation models that aim to make clinical AI portable across modalities and health systems.
MultiMedCLIP pairs a multidomain vision encoder (a ViT/CLIP backbone) with a multilingual text encoder, trained with a contrastive objective to align medical images and reports. MultiMedLM uses a BERT-Base backbone configured with six encoder layers and three decoder layers for multilingual report generation. Pretraining draws on English-centric public corpora, principally MIMIC-CXR (377,110 chest X-rays with 227,835 reports) and the COVID-19-CT-CXR collection, with Chinese supervision created by machine-translating English reports. On report generation, the model reaches 13.7 BLEU-4 in the zero-shot CXR-to-English setting, rising to 16.3 BLEU-4 with 10% labeled data and 20.3 BLEU-4 under full supervision; disease classification AUC ranges from roughly 0.793 to 0.983 depending on the dataset and training regime.
M3FM is aimed at clinical settings where labeled training data is scarce or absent, such as rare diseases, emerging infectious outbreaks, and non-English-speaking populations. It can screen and classify chest X-ray and CT studies and draft diagnostic reports in English or Chinese, making it useful for radiology triage, decision support, and research into label-efficient medical AI. Because diagnosis and reporting share one pretrained backbone, the model can be adapted to new tasks with little or no additional annotation, lowering the barrier for deploying imaging AI in resource-limited health systems.
M3FM demonstrates that a single foundation model can transfer across imaging modalities, disease categories, and languages without retraining, directly tackling two persistent equity gaps in clinical AI: rare conditions and underserved languages. By open-sourcing its code under the Apache-2.0 license, the Oxford/GSK-led team makes the approach reproducible for the research community, though pretrained weights are not distributed and Chinese supervision relies on machine translation rather than native-language reports. As an early multimodal, multilingual medical foundation model, it contributes to the broader effort to build clinical AI systems that generalize beyond the well-resourced, English-only settings in which most models are trained.
Liu, F., et al. (2025) A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis. npj Digital Medicine.
DOI: 10.1038/s41746-024-01339-7Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data