RadFM

Shanghai Jiao Tong University / Shanghai AI Laboratory

Radiology foundation model that reads interleaved 2D and 3D scans with text for diagnosis, visual question answering, and report generation.

Released: August 2023

Parameters: 14 Billion

RadFM (Radiology Foundation Model) is a generalist multimodal foundation model for radiology that processes arbitrary combinations of 2D and 3D medical scans interleaved with natural language text. Unlike earlier medical vision-language models that were restricted to single 2D images (typically chest X-rays), RadFM accepts multiple images of mixed dimensionality within a single prompt and generates free-form text, enabling it to address modality recognition, disease diagnosis, visual question answering, report generation, and rationale diagnosis within one unified interface.

The model was developed by researchers at Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory, with the original preprint posted in August 2023 and the peer-reviewed version published in Nature Communications in August 2025. Its central contribution is treating radiology as a visually-conditioned text-generation problem at web scale, training across dozens of modalities and anatomical regions rather than specializing on a single body part or scanner type.

A key enabling artifact is MedMD, a large-scale medical multimodal dataset of roughly 16 million 2D and 3D scans paired with text and covering more than 5,000 diseases. The team contributed four new datasets toward this corpus and released a cleaned, radiology-focused subset (RadMD) for instruction tuning, positioning RadFM as one of the first openly released foundation models built to reason jointly over volumetric and planar medical imaging.

Key Features

Unified 2D and 3D handling: A 3D Vision Transformer with a Perceiver module encodes both planar and volumetric scans into a fixed 32-token representation per image, so CT, MRI, X-ray, and other modalities share one input pathway.
Interleaved multi-image prompts: The model accepts several images intermixed with text in a single query, mirroring how radiologists reason across prior studies and multiple views rather than over one image at a time.
Generalist task coverage: A single checkpoint performs modality recognition, closed- and open-ended diagnosis, visual question answering, report generation, and rationale generation without task-specific heads.
Open weights and data: The 14B-parameter checkpoint is released on HuggingFace under an MIT license, with training data tables and the RadMD subset published to support reproduction and downstream fine-tuning.

Technical Details

RadFM couples a 3D ViT vision encoder and Perceiver resampler to a MedLLaMA-13B language backbone (a medical-domain-adapted LLaMA-13B), for roughly 14 billion total parameters. Images are inserted into the text stream via special placeholder tokens whose embeddings are replaced by the visual tokens, allowing the autoregressive decoder to fuse vision and language naturally. The model is trained by visually-conditioned generative pre-training on MedMD (~16M 2D/3D scans across 5,000+ diseases) and then instruction-tuned on RadMD, a cleaned set of about 3 million radiologic visual-language pairs. The authors contributed four new datasets — PMC-Inline, RP3D, PMC-CaseReport, and MPx — adding roughly 13 million 2D images and 615,000 3D scans. On the RadBench benchmark, which spans five radiology task categories, RadFM outperforms prior accessible multimodal models including OpenFlamingo, MedFlamingo, MedVInT, and GPT-4V on both automatic metrics and human evaluation.

Applications

RadFM targets radiologists, clinical researchers, and medical-AI developers who need a single model that reasons over heterogeneous imaging studies. Potential uses include drafting preliminary radiology reports, answering clinical questions grounded in a patient's scans, triaging across modalities, and serving as a pretrained backbone for fine-tuning on institution-specific tasks. Because it ingests 3D volumes directly, it is particularly relevant for CT and MRI workflows that earlier 2D-only vision-language models could not address.

Impact

RadFM was among the first openly released generalist radiology foundation models to natively span 2D and 3D imaging with interleaved image-text prompting, and its publicly available MedMD/RadMD data and MIT-licensed weights have made it a common baseline and starting point for subsequent medical multimodal research. Its peer-reviewed publication in Nature Communications reflects broader adoption of the generalist, instruction-tuned paradigm in medical imaging. As with all such models, outputs are not clinically validated for autonomous diagnosis, performance varies across underrepresented modalities and populations, and human expert oversight remains essential before any clinical use.

Citation

Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Wu, C., et al. (2025) Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nature Communications.

DOI: 10.1038/s41467-025-62385-7

Recent citations

Papers that recently cited this model.

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation
Yi Lin, Yihao Ding, E. Benishay, et al.
Jul 2026
0Influential
Language-Guided Segmentation of Medical Images: A Review of Foundation Models
Saqib Qamar
Bioengineering · Jul 2026
0
Illusion of competence: vision–language models provide confident but inaccurate explanations in cytological diagnostics
I. Kukuljan, Muhammed Furkan Dasdelen, Julia Schäfer, et al.
Scientific Reports · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Generalist foundation models from a multimodal dataset for 3D computed tomography.
I. Hamamci, Sezgin Er, Furkan Almas, et al.
Nature Biomedical Engineering · Mar 2024
177
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
Fan Bai, Yuxin Du, Tiejun Huang, et al.
arXiv.org · Mar 2024
159
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
L. Blankemeier, J. Cohen, Ashwin Kumar, et al.
Nature · Jun 2024
128
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai, Jike Zhong, Ming Li, et al.
IEEE Transactions on Medical Imaging · Mar 2025
126
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li, Tian Yan, Yuanting Pan, et al.
Conference on Empirical Methods in Natural Language Processing · Jul 2024
120Influential

Citations

Total Citations263

Influential24

References72

GitHub

Stars561

Forks65

Open Issues23

Contributors2

Last Push1y ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes20

Last Modified2y ago

Fields of citing research

Medicine98%
Computer Science96%
Engineering20%
Biology2%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

84Open

Usability — can I run it?100

Reproducibility — can I retrain it?62

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Official Website HuggingFace Model Dataset

Key Features

Unified 2D and 3D handling: A 3D Vision Transformer with a Perceiver module encodes both planar and volumetric scans into a fixed 32-token representation per image, so CT, MRI, X-ray, and other modalities share one input pathway.

Interleaved multi-image prompts: The model accepts several images intermixed with text in a single query, mirroring how radiologists reason across prior studies and multiple views rather than over one image at a time.

Generalist task coverage: A single checkpoint performs modality recognition, closed- and open-ended diagnosis, visual question answering, report generation, and rationale generation without task-specific heads.

Open weights and data: The 14B-parameter checkpoint is released on HuggingFace under an MIT license, with training data tables and the RadMD subset published to support reproduction and downstream fine-tuning.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Yi Lin, Yihao Ding, E. Benishay, et al.

Jul 2026

0Influential

Language-Guided Segmentation of Medical Images: A Review of Foundation Models

Saqib Qamar

Bioengineering · Jul 2026

Illusion of competence: vision–language models provide confident but inaccurate explanations in cytological diagnostics

I. Kukuljan, Muhammed Furkan Dasdelen, Julia Schäfer, et al.

Scientific Reports · Jul 2026

RadFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Recent citations

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

RadFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Recent citations

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact