M3FM

University of Oxford / GSK.ai / Amazon Web Services / University of Rochester / Tencent AI Lab / Shanghai Jiao Tong University / Westlake University

Multimodal medical imaging foundation model for zero-shot clinical diagnosis and report generation from chest X-ray and CT in English and Chinese.

Released: February 2025

M3FM (Multimodal, Multidomain, Multilingual Foundation Model) is a medical foundation model built for zero-shot clinical diagnosis across imaging modalities, disease domains, and languages. It was developed by researchers at the University of Oxford together with collaborators at GlaxoSmithKline, Amazon, the University of Rochester, Tencent, Shanghai Jiao Tong University, and Westlake University, and published in npj Digital Medicine in February 2025.

A central obstacle in clinical AI is the scarcity of labeled data for rare diseases and for languages other than English, which prevents conventional supervised models from generalizing to new conditions or populations. M3FM addresses this by learning a shared visual-textual representation space from large public medical corpora, so that the aligned visual features can be used to classify diseases and generate diagnostic reports for categories and languages never seen with explicit labels during training.

The model is organized into two components: MultiMedCLIP, which aligns medical images with clinical text across domains and languages, and MultiMedLM, which generates multilingual diagnostic reports. Together they let a single pretrained system support both classification and report generation without task-specific retraining, positioning M3FM among the wave of vision-language foundation models that aim to make clinical AI portable across modalities and health systems.

Key Features

Zero-shot clinical diagnosis: Aligned visual representations from MultiMedCLIP can be used directly for disease classification on unseen categories, avoiding the need for labeled examples of every target condition.
Multidomain imaging: Operates across chest X-ray (CXR) and computed tomography (CT), covering 16 diseases including 14 non-infectious and 2 infectious conditions.
Multilingual reporting: MultiMedLM produces diagnostic reports in both English and Chinese, targeting the data scarcity that hinders non-English clinical AI.
Contrastive vision-language alignment: A CLIP-style objective maps images and reports into a shared latent space, enabling cross-modal and cross-lingual transfer.
Label-efficient adaptation: Performance improves smoothly from zero-shot to few-shot to full supervision, so small amounts of labeled data yield meaningful gains.

Technical Details

MultiMedCLIP pairs a multidomain vision encoder (a ViT/CLIP backbone) with a multilingual text encoder, trained with a contrastive objective to align medical images and reports. MultiMedLM uses a BERT-Base backbone configured with six encoder layers and three decoder layers for multilingual report generation. Pretraining draws on English-centric public corpora, principally MIMIC-CXR (377,110 chest X-rays with 227,835 reports) and the COVID-19-CT-CXR collection, with Chinese supervision created by machine-translating English reports. On report generation, the model reaches 13.7 BLEU-4 in the zero-shot CXR-to-English setting, rising to 16.3 BLEU-4 with 10% labeled data and 20.3 BLEU-4 under full supervision; disease classification AUC ranges from roughly 0.793 to 0.983 depending on the dataset and training regime.

Applications

M3FM is aimed at clinical settings where labeled training data is scarce or absent, such as rare diseases, emerging infectious outbreaks, and non-English-speaking populations. It can screen and classify chest X-ray and CT studies and draft diagnostic reports in English or Chinese, making it useful for radiology triage, decision support, and research into label-efficient medical AI. Because diagnosis and reporting share one pretrained backbone, the model can be adapted to new tasks with little or no additional annotation, lowering the barrier for deploying imaging AI in resource-limited health systems.

Impact

M3FM demonstrates that a single foundation model can transfer across imaging modalities, disease categories, and languages without retraining, directly tackling two persistent equity gaps in clinical AI: rare conditions and underserved languages. By open-sourcing its code under the Apache-2.0 license, the Oxford/GSK-led team makes the approach reproducible for the research community, though pretrained weights are not distributed and Chinese supervision relies on machine translation rather than native-language reports. As an early multimodal, multilingual medical foundation model, it contributes to the broader effort to build clinical AI systems that generalize beyond the well-resourced, English-only settings in which most models are trained.

Citation

A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis

Liu, F., et al. (2025) A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis. npj Digital Medicine.

DOI: 10.1038/s41746-024-01339-7

Recent citations

Papers that recently cited this model.

Generative Artificial Intelligence and Large Language Models in Clinical Oncology
Yunfang Yu, Zhenhui Zhao, Zehua Wang, et al.
MedComm · Jun 2026
0
BRIDGE: benchmarking large language models for understanding real-world clinical practice texts.
Jiageng Wu, Bowen Gu, Ren Zhou, et al.
Nature Biomedical Engineering · Jun 2026
0
Deep learning advancements for cardiovascular diseases (CVDs) diagnosis: Imaging modalities, challenges, and future perspectives
Inayatul Haq, Haomin Liang, Ke Zeng, et al.
Biomedical Signal Processing and Control · Jun 2026
0

Top citations

The most-cited papers that cite this model.

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Jiageng Wu, B. Gu, Ren Zhou, et al.
arXiv.org · Apr 2025
16
Keyword-based AI assistance in the generation of radiology reports: A pilot study
Fei Dong, Shouping Nie, Manling Chen, et al.
npj Digital Medicine · Aug 2025
10
Multimodal large language models in medical research and clinical practice: Development, applications, challenges and future
Peng Jun Xu, Shuang Kan, Jing Jin, et al.
Neurocomputing · Oct 2025
5
Automated Classification of Public Transport Complaints via Text Mining Using LLMs and Embeddings
Daniyar Rakhimzhanov, Saule Belginova, D. Yedilkhan
Inf. · Jul 2025
5
When Mathematical Methods Meet Artificial Intelligence and Mobile Edge Computing
Yuzhu Liang, Xiaotong Bi, Ruihan Shen, et al.
Mathematics · May 2025
5

Citations

Total Citations38

Influential1

References29

GitHub

Stars19

Forks3

Open Issues1

Contributors1

Last Push1y ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science95%
Medicine84%
Engineering24%
Linguistics8%
Biology5%
Political Science3%
Materials Science3%
Economics3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

60Partial

Usability — can I run it?64

Reproducibility — can I retrain it?55

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Zero-shot clinical diagnosis: Aligned visual representations from MultiMedCLIP can be used directly for disease classification on unseen categories, avoiding the need for labeled examples of every target condition.

Multidomain imaging: Operates across chest X-ray (CXR) and computed tomography (CT), covering 16 diseases including 14 non-infectious and 2 infectious conditions.

Multilingual reporting: MultiMedLM produces diagnostic reports in both English and Chinese, targeting the data scarcity that hinders non-English clinical AI.

Contrastive vision-language alignment: A CLIP-style objective maps images and reports into a shared latent space, enabling cross-modal and cross-lingual transfer.

Label-efficient adaptation: Performance improves smoothly from zero-shot to few-shot to full supervision, so small amounts of labeled data yield meaningful gains.

Technical Details

Applications

Impact

M3FM

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

M3FM

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact