bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

M3D

Beijing Academy of Artificial Intelligence

A multimodal large language model for 3D medical imaging, handling retrieval, report generation, VQA, positioning, and segmentation on CT volumes.

Released: March 2024

M3D is a multimodal large language model (MLLM) family built for 3D medical image analysis, addressing a gap left by most medical vision-language models that operate only on 2D slices or photographs. Volumetric modalities such as CT and MRI carry rich spatial context across hundreds of slices, and clinical reasoning depends on relationships that span the full volume. M3D extends the instruction-following MLLM paradigm to this 3D setting, letting a single model read a whole scan and respond to natural-language queries about it.

Developed by the Distributed and Collaborative AI Lab at the Beijing Academy of Artificial Intelligence (BAAI) and released as a preprint in March 2024 by Fan Bai, Yuxin Du, Tiejun Huang, Max Q.-H. Meng, and Bo Zhao, the project couples three pieces: M3D-Data, a large-scale 3D multimodal dataset; M3D-LaMed, the model itself; and M3D-Bench, an evaluation suite spanning eight tasks. Together they form one of the first end-to-end stacks for general-purpose 3D radiology understanding, with code and weights released openly (though the training data is now access-restricted, as noted below).

What makes M3D notable is breadth from a single checkpoint. Rather than training a separate network per task, the model performs image-text retrieval, report generation, closed- and open-ended visual question answering, referring expression comprehension and generation, positioning, semantic segmentation, and referring expression segmentation, all driven by language prompts.

#Key Features

  • 3D-native understanding: A 3D vision transformer encoder (M3D-CLIP) ingests entire CT/MRI volumes, preserving cross-slice spatial relationships that 2D slice-based models discard.
  • Unified multi-task model: One M3D-LaMed checkpoint covers eight distinct tasks, from report generation to referring expression segmentation, selected via natural-language instructions.
  • Promptable segmentation: A dedicated segmentation module lets the model produce voxel masks in response to text queries, linking language references to precise anatomical regions.
  • Large-scale dataset: M3D-Data contributes 120K image-text pairs and 662K instruction-response pairs, among the largest 3D multimodal medical resources at release, though availability is now constrained (see below).
  • Open weights and code: Model checkpoints (Apache-2.0) and the BAAI-DCAI/M3D code stack including M3D-Bench (MIT) are publicly available. The training data is more restricted: the M3D-Cap dataset is currently under a DMCA takedown and inaccessible, and the Radiopaedia-derived report data is limited to non-commercial use.

#Technical Details

M3D-LaMed connects a pretrained 3D vision transformer encoder (M3D-CLIP, trained contrastively on the image-text pairs) to a large language model backbone through a projection layer, following the visual-instruction-tuning recipe popularized by 2D MLLMs but adapted to volumetric input. Two backbone variants are released: M3D-LaMed-Phi-3-4B (built on Microsoft's ~4B-parameter Phi-3) and M3D-LaMed-Llama-2-7B (built on Meta's 7B Llama-2). A promptable segmentation component, conditioned on the model's hidden states, outputs 3D masks for spatial tasks. Training proceeds from contrastive vision-language pretraining through instruction tuning on the 662K instruction-response pairs of M3D-Data. The authors evaluate on M3D-Bench across all eight tasks, reporting that the Phi-3 variant is both lighter and competitive with or stronger than the larger Llama-2 configuration on several benchmarks.

#Applications

M3D targets radiology and clinical research workflows that involve 3D scans: drafting preliminary radiology reports from CT volumes, answering structured questions about findings, retrieving similar cases by image or text, localizing and segmenting anatomy or lesions from free-text descriptions, and supporting interactive review where a clinician queries a scan in natural language. Researchers building medical vision-language systems also benefit from the open dataset and benchmark as a shared foundation for training and comparison.

#Impact

By extending general-purpose multimodal LLMs to native 3D medical imaging and releasing data, models, and benchmark together, M3D helped catalyze a wave of 3D medical vision-language research, with subsequent systems explicitly building on or benchmarking against it. Its eight-task formulation offered an early template for unified 3D radiology assistants. Limitations remain typical of the class: training data is dominated by CT (with report sources such as Radiopaedia carrying non-commercial terms), evaluation is largely automatic rather than prospective clinical validation, and generated reports require expert oversight before any clinical use.

Citation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Preprint

Bai, F., et al. (2024) M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv.org.

DOI: 10.48550/arXiv.2404.00578

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations159
Influential42
References68

GitHub

Stars442
Forks32
Open Issues29
Contributors2
Last Push1y ago
LanguagePython
LicenseMIT

HuggingFace

Downloads1K
Likes8
Last Modified1y ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
77Open
Usability — can I run it?94
Reproducibility — can I retrain it?55
Model Openness Framework
Unclassified
Missing required components

Tags

ctimage_text_retrievalinstruction_tuninglanguage_modelmrimultimodalradiologyreport_generationsegmentationtransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset