Beijing Academy of Artificial Intelligence
A multimodal large language model for 3D medical imaging, handling retrieval, report generation, VQA, positioning, and segmentation on CT volumes.
M3D is a multimodal large language model (MLLM) family built for 3D medical image analysis, addressing a gap left by most medical vision-language models that operate only on 2D slices or photographs. Volumetric modalities such as CT and MRI carry rich spatial context across hundreds of slices, and clinical reasoning depends on relationships that span the full volume. M3D extends the instruction-following MLLM paradigm to this 3D setting, letting a single model read a whole scan and respond to natural-language queries about it.
Developed by the Distributed and Collaborative AI Lab at the Beijing Academy of Artificial Intelligence (BAAI) and released as a preprint in March 2024 by Fan Bai, Yuxin Du, Tiejun Huang, Max Q.-H. Meng, and Bo Zhao, the project couples three pieces: M3D-Data, a large-scale 3D multimodal dataset; M3D-LaMed, the model itself; and M3D-Bench, an evaluation suite spanning eight tasks. Together they form one of the first end-to-end stacks for general-purpose 3D radiology understanding, with code and weights released openly (though the training data is now access-restricted, as noted below).
What makes M3D notable is breadth from a single checkpoint. Rather than training a separate network per task, the model performs image-text retrieval, report generation, closed- and open-ended visual question answering, referring expression comprehension and generation, positioning, semantic segmentation, and referring expression segmentation, all driven by language prompts.
M3D-LaMed connects a pretrained 3D vision transformer encoder (M3D-CLIP, trained contrastively on the image-text pairs) to a large language model backbone through a projection layer, following the visual-instruction-tuning recipe popularized by 2D MLLMs but adapted to volumetric input. Two backbone variants are released: M3D-LaMed-Phi-3-4B (built on Microsoft's ~4B-parameter Phi-3) and M3D-LaMed-Llama-2-7B (built on Meta's 7B Llama-2). A promptable segmentation component, conditioned on the model's hidden states, outputs 3D masks for spatial tasks. Training proceeds from contrastive vision-language pretraining through instruction tuning on the 662K instruction-response pairs of M3D-Data. The authors evaluate on M3D-Bench across all eight tasks, reporting that the Phi-3 variant is both lighter and competitive with or stronger than the larger Llama-2 configuration on several benchmarks.
M3D targets radiology and clinical research workflows that involve 3D scans: drafting preliminary radiology reports from CT volumes, answering structured questions about findings, retrieving similar cases by image or text, localizing and segmenting anatomy or lesions from free-text descriptions, and supporting interactive review where a clinician queries a scan in natural language. Researchers building medical vision-language systems also benefit from the open dataset and benchmark as a shared foundation for training and comparison.
By extending general-purpose multimodal LLMs to native 3D medical imaging and releasing data, models, and benchmark together, M3D helped catalyze a wave of 3D medical vision-language research, with subsequent systems explicitly building on or benchmarking against it. Its eight-task formulation offered an early template for unified 3D radiology assistants. Limitations remain typical of the class: training data is dominated by CT (with report sources such as Radiopaedia carrying non-commercial terms), evaluation is largely automatic rather than prospective clinical validation, and generated reports require expert oversight before any clinical use.
Bai, F., et al. (2024) M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv.org.
DOI: 10.48550/arXiv.2404.00578Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data