MAIRA-2

Microsoft Research multimodal LLM for grounded chest X-ray report generation, localizing each described finding with bounding boxes on the image.

Released: June 2024

Parameters: 7 Billion

MAIRA-2 is a radiology-specific multimodal large language model from Microsoft Research, introduced in June 2024, that generates findings sections of chest X-ray reports directly from imaging and clinical context. Its defining contribution is grounded report generation: alongside the narrative text, MAIRA-2 emits bounding boxes that localize each described finding on the frontal image, tying the language of a report to concrete spatial evidence. This addresses a long-standing trust and verifiability problem in automated radiology reporting, where free-text generators produce plausible prose without indicating where on the image a finding was observed.

Most prior report-generation systems consume a single image and emit unstructured text, leaving clinicians unable to check whether a stated abnormality corresponds to a real region. MAIRA-2 instead conditions on a richer, more realistic reporting context—the current frontal view, an optional lateral view, a prior frontal image and its report, and structured indication, technique, and comparison fields—mirroring how radiologists actually work. The authors also formalize and benchmark the grounded reporting task and introduce RadFact, an LLM-based metric that scores report correctness and completeness sentence by sentence.

Developed by Shruthi Bannur, Kenza Bouzid, and colleagues at Microsoft Research, MAIRA-2 builds on the earlier MAIRA-1 system and is released with open weights for research use.

Key Features

Grounded findings localization: Each described finding can be accompanied by zero or more bounding boxes on the current frontal image, making generated reports spatially verifiable.
Realistic reporting context: Accepts frontal and lateral views, a prior study (image plus report), and indication/technique/comparison fields, rather than a single isolated image.
CXR-specialized vision encoder: Uses RAD-DINO-MAIRA-2, a chest-X-ray-tuned image encoder, kept frozen and paired with a trained projection layer into the language model.
Multiple output modes: Produces conventional narrative reports, fully grounded reports, or phrase-grounding outputs that localize a supplied finding description.
Sentence-level evaluation: Accompanied by RadFact, an LLM-driven metric quantifying correctness and completeness at the level of individual report sentences.

Technical Details

MAIRA-2 couples a frozen RAD-DINO-MAIRA-2 vision encoder with a projection layer trained from scratch and a fully fine-tuned Vicuna-7B-v1.5 language backbone (roughly 7B parameters). Image embeddings are projected into the LLM token space and interleaved with the structured textual context, allowing a single autoregressive model to handle narrative generation, grounded generation, and phrase grounding. Training drew on a mix of public and private chest X-ray corpora: MIMIC-CXR (USA), PadChest (Spain), and a private USMix set, combining roughly 226,000 ungrounded examples with about 57,000 grounded examples carrying box annotations. On these data, MAIRA-2 reports state-of-the-art results on existing report-generation benchmarks (MIMIC-CXR and PadChest) and establishes baselines for the new grounded reporting task.

Applications

MAIRA-2 targets research into trustworthy automated radiology reporting, where a draft report and its grounded localizations can be cross-checked against the underlying image. Potential use cases include assisting radiologists with report drafting, surfacing where in an image a finding originates, supporting education by linking report language to anatomy, and serving as a strong open baseline for benchmarking multimodal medical models. Microsoft restricts the release to research use only and explicitly states it is not intended for clinical practice.

Impact

By introducing grounded report generation and the RadFact evaluation framework, MAIRA-2 reframed radiology report generation from a pure text task into a spatially grounded one, raising the bar for verifiability in medical vision-language models. Its open weights, CXR-specific RAD-DINO encoder, and detailed benchmarking on MIMIC-CXR and PadChest have made it a widely cited reference point for subsequent grounded reporting and medical multimodal work. Key limitations are its research-only license, single-frontal-image grounding, and reliance on a partly private training corpus, which constrain reproducibility and direct clinical deployment.

Citation

MAIRA-2: Grounded Radiology Report Generation

Preprint

Bannur, S., et al. (2024) MAIRA-2: Grounded Radiology Report Generation. arXiv.org.

DOI: 10.48550/arXiv.2406.04449

Recent citations

Papers that recently cited this model.

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy
Chunzheng Zhu, Lei Tian, Bohan Tan, et al.
Jul 2026
0
GRCD: Grounded Region Change Detection for Multi-Finding Chest X-Ray Pairs
O. R. R. Aranya, Peyman Najafirad, Kevin Desai
Jul 2026
0
Discrete Diffusion Language Models for Interactive Radiology Report Drafting
Max Van Puyvelde, Halil Ibrahim Gulluk, W. Criekinge, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, X. Liu, et al.
Information Fusion · May 2024
116
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation
Cheng-Yi Li, Kao-Jung Chang, Cheng-Fu Yang, et al.
Nature Communications · Jul 2024
73
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Zhihe Yang, Xufang Luo, Dongqi Han, et al.
Computer Vision and Pattern Recognition · Jan 2025
65
Multimodal Large Language Models in Medical Imaging: Current State and Future Directions
Yoojin Nam, Dong Yeong Kim, Sunggu Kyung, et al.
Korean Journal of Radiology · Aug 2025
60
MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow
Ziyue Wang, Junde Wu, Linghan Cai, et al.
Mar 2025
52

Citations

Total Citations153

Influential21

References80

HuggingFace

Downloads4.4K

Likes80

Last Modified11mo ago

Pipelinetext-generation

Fields of citing research

Computer Science97%
Medicine95%
Engineering17%
Business1%
Linguistics1%
Mathematics1%
Psychology1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

35Closed

Usability — can I run it?43

Reproducibility — can I retrain it?21

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website HuggingFace Model

Key Features

Grounded findings localization: Each described finding can be accompanied by zero or more bounding boxes on the current frontal image, making generated reports spatially verifiable.

Realistic reporting context: Accepts frontal and lateral views, a prior study (image plus report), and indication/technique/comparison fields, rather than a single isolated image.

CXR-specialized vision encoder: Uses RAD-DINO-MAIRA-2, a chest-X-ray-tuned image encoder, kept frozen and paired with a trained projection layer into the language model.

Multiple output modes: Produces conventional narrative reports, fully grounded reports, or phrase-grounding outputs that localize a supplied finding description.

Sentence-level evaluation: Accompanied by RadFact, an LLM-driven metric quantifying correctness and completeness at the level of individual report sentences.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

Chunzheng Zhu, Lei Tian, Bohan Tan, et al.

Jul 2026

GRCD: Grounded Region Change Detection for Multi-Finding Chest X-Ray Pairs

O. R. R. Aranya, Peyman Najafirad, Kevin Desai

Jul 2026

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Max Van Puyvelde, Halil Ibrahim Gulluk, W. Criekinge, et al.

Jul 2026

MAIRA-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

MAIRA-2: Grounded Radiology Report Generation

Recent citations

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

GRCD: Grounded Region Change Detection for Multi-Finding Chest X-Ray Pairs

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Top citations

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

MAIRA-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

MAIRA-2: Grounded Radiology Report Generation

Recent citations

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

GRCD: Grounded Region Change Detection for Multi-Finding Chest X-Ray Pairs

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Top citations

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact