M3D

Beijing Academy of Artificial Intelligence

Multimodal large language model for 3D medical imaging that handles report generation, visual question answering, and segmentation on CT volumes.

Released: March 2024

M3D is a multimodal large language model (MLLM) family built for 3D medical image analysis, addressing a gap left by most medical vision-language models that operate only on 2D slices or photographs. Volumetric modalities such as CT and MRI carry rich spatial context across hundreds of slices, and clinical reasoning depends on relationships that span the full volume. M3D extends the instruction-following MLLM paradigm to this 3D setting, letting a single model read a whole scan and respond to natural-language queries about it.

Developed by the Distributed and Collaborative AI Lab at the Beijing Academy of Artificial Intelligence (BAAI) and released as a preprint in March 2024 by Fan Bai, Yuxin Du, Tiejun Huang, Max Q.-H. Meng, and Bo Zhao, the project couples three pieces: M3D-Data, a large-scale 3D multimodal dataset; M3D-LaMed, the model itself; and M3D-Bench, an evaluation suite spanning eight tasks. Together they form one of the first end-to-end stacks for general-purpose 3D radiology understanding, with code and weights released openly (though the training data is now access-restricted, as noted below).

What makes M3D notable is breadth from a single checkpoint. Rather than training a separate network per task, the model performs image-text retrieval, report generation, closed- and open-ended visual question answering, referring expression comprehension and generation, positioning, semantic segmentation, and referring expression segmentation, all driven by language prompts.

Key Features

3D-native understanding: A 3D vision transformer encoder (M3D-CLIP) ingests entire CT/MRI volumes, preserving cross-slice spatial relationships that 2D slice-based models discard.
Unified multi-task model: One M3D-LaMed checkpoint covers eight distinct tasks, from report generation to referring expression segmentation, selected via natural-language instructions.
Promptable segmentation: A dedicated segmentation module lets the model produce voxel masks in response to text queries, linking language references to precise anatomical regions.
Large-scale dataset: M3D-Data contributes 120K image-text pairs and 662K instruction-response pairs, among the largest 3D multimodal medical resources at release, though availability is now constrained (see below).
Open weights and code: Model checkpoints (Apache-2.0) and the BAAI-DCAI/M3D code stack including M3D-Bench (MIT) are publicly available. The training data is more restricted: the M3D-Cap dataset is currently under a DMCA takedown and inaccessible, and the Radiopaedia-derived report data is limited to non-commercial use.

Technical Details

M3D-LaMed connects a pretrained 3D vision transformer encoder (M3D-CLIP, trained contrastively on the image-text pairs) to a large language model backbone through a projection layer, following the visual-instruction-tuning recipe popularized by 2D MLLMs but adapted to volumetric input. Two backbone variants are released: M3D-LaMed-Phi-3-4B (built on Microsoft's ~4B-parameter Phi-3) and M3D-LaMed-Llama-2-7B (built on Meta's 7B Llama-2). A promptable segmentation component, conditioned on the model's hidden states, outputs 3D masks for spatial tasks. Training proceeds from contrastive vision-language pretraining through instruction tuning on the 662K instruction-response pairs of M3D-Data. The authors evaluate on M3D-Bench across all eight tasks, reporting that the Phi-3 variant is both lighter and competitive with or stronger than the larger Llama-2 configuration on several benchmarks.

Applications

M3D targets radiology and clinical research workflows that involve 3D scans: drafting preliminary radiology reports from CT volumes, answering structured questions about findings, retrieving similar cases by image or text, localizing and segmenting anatomy or lesions from free-text descriptions, and supporting interactive review where a clinician queries a scan in natural language. Researchers building medical vision-language systems also benefit from the open dataset and benchmark as a shared foundation for training and comparison.

Impact

By extending general-purpose multimodal LLMs to native 3D medical imaging and releasing data, models, and benchmark together, M3D helped catalyze a wave of 3D medical vision-language research, with subsequent systems explicitly building on or benchmarking against it. Its eight-task formulation offered an early template for unified 3D radiology assistants. Limitations remain typical of the class: training data is dominated by CT (with report sources such as Radiopaedia carrying non-commercial terms), evaluation is largely automatic rather than prospective clinical validation, and generated reports require expert oversight before any clinical use.

Citation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Preprint

Bai, F., et al. (2024) M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv.org.

DOI: 10.48550/arXiv.2404.00578

Recent citations

Papers that recently cited this model.

Multi-LLM Collaborative MRI Report Generation for Visual Instruction Tuning in Brain Oncology
Sinyoung Ra, Jonghun Kim, Hyunjin Park
Jul 2026
0
MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation
Yi Lin, Yihao Ding, E. Benishay, et al.
Jul 2026
0Influential
Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models
Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, X. Liu, et al.
Information Fusion · May 2024
116
Next-generation agentic AI for transforming healthcare
Nalan Karunanayake
Informatics and Health · 2025
113
A Survey on Benchmarks of Multimodal Large Language Models
Jian Li, Weiheng Lu
arXiv.org · Aug 2024
86
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin, Yifan Zhu, Xin Mei, et al.
Information Fusion · Aug 2024
85Influential
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts
Qiuhui Chen, Xinyue Hu, Zirui Wang, et al.
Asian Conference on Computer Vision · May 2023
85

Citations

Total Citations172

Influential43

References68

GitHub

Stars454

Forks33

Open Issues29

Contributors2

Last Push1y ago

LanguagePython

LicenseMIT

HuggingFace

Downloads924

Likes8

Last Modified2y ago

Pipelinetext-generation

Fields of citing research

Computer Science98%
Medicine91%
Engineering26%
Linguistics3%
Environmental Science1%
Biology1%
Mathematics1%
Physics1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

77Open

Usability — can I run it?94

Reproducibility — can I retrain it?55

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

3D-native understanding: A 3D vision transformer encoder (M3D-CLIP) ingests entire CT/MRI volumes, preserving cross-slice spatial relationships that 2D slice-based models discard.

Unified multi-task model: One M3D-LaMed checkpoint covers eight distinct tasks, from report generation to referring expression segmentation, selected via natural-language instructions.

Promptable segmentation: A dedicated segmentation module lets the model produce voxel masks in response to text queries, linking language references to precise anatomical regions.

Large-scale dataset: M3D-Data contributes 120K image-text pairs and 662K instruction-response pairs, among the largest 3D multimodal medical resources at release, though availability is now constrained (see below).

Open weights and code: Model checkpoints (Apache-2.0) and the BAAI-DCAI/M3D code stack including M3D-Bench (MIT) are publicly available. The training data is more restricted: the M3D-Cap dataset is currently under a DMCA takedown and inaccessible, and the Radiopaedia-derived report data is limited to non-commercial use.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Multi-LLM Collaborative MRI Report Generation for Visual Instruction Tuning in Brain Oncology

Sinyoung Ra, Jonghun Kim, Hyunjin Park

Jul 2026

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Yi Lin, Yihao Ding, E. Benishay, et al.

Jul 2026

0Influential

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Zhuoyuan Fu, Zeshang Li, Yiqiong Zhang, et al.

Jul 2026

M3D

#Key Features

#Technical Details

#Applications

#Impact

Citation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Recent citations

Multi-LLM Collaborative MRI Report Generation for Visual Instruction Tuning in Brain Oncology

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

M3D

#Key Features

#Technical Details

#Applications

#Impact

Citation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Recent citations

Multi-LLM Collaborative MRI Report Generation for Visual Instruction Tuning in Brain Oncology

MonteRET: AI Agent Enhancing Multimodal LLMs with Multi-granularity Knowledge Retrieval for Chest CT Report Generation

Towards Enhancing 3D Spatial Reasoning in Medical Multimodal Large Language Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact