Microsoft Research multimodal LLM for grounded chest X-ray report generation, localizing each described finding with bounding boxes on the image.
MAIRA-2 is a radiology-specific multimodal large language model from Microsoft Research, introduced in June 2024, that generates findings sections of chest X-ray reports directly from imaging and clinical context. Its defining contribution is grounded report generation: alongside the narrative text, MAIRA-2 emits bounding boxes that localize each described finding on the frontal image, tying the language of a report to concrete spatial evidence. This addresses a long-standing trust and verifiability problem in automated radiology reporting, where free-text generators produce plausible prose without indicating where on the image a finding was observed.
Most prior report-generation systems consume a single image and emit unstructured text, leaving clinicians unable to check whether a stated abnormality corresponds to a real region. MAIRA-2 instead conditions on a richer, more realistic reporting context—the current frontal view, an optional lateral view, a prior frontal image and its report, and structured indication, technique, and comparison fields—mirroring how radiologists actually work. The authors also formalize and benchmark the grounded reporting task and introduce RadFact, an LLM-based metric that scores report correctness and completeness sentence by sentence.
Developed by Shruthi Bannur, Kenza Bouzid, and colleagues at Microsoft Research, MAIRA-2 builds on the earlier MAIRA-1 system and is released with open weights for research use.
MAIRA-2 couples a frozen RAD-DINO-MAIRA-2 vision encoder with a projection layer trained from scratch and a fully fine-tuned Vicuna-7B-v1.5 language backbone (roughly 7B parameters). Image embeddings are projected into the LLM token space and interleaved with the structured textual context, allowing a single autoregressive model to handle narrative generation, grounded generation, and phrase grounding. Training drew on a mix of public and private chest X-ray corpora: MIMIC-CXR (USA), PadChest (Spain), and a private USMix set, combining roughly 226,000 ungrounded examples with about 57,000 grounded examples carrying box annotations. On these data, MAIRA-2 reports state-of-the-art results on existing report-generation benchmarks (MIMIC-CXR and PadChest) and establishes baselines for the new grounded reporting task.
MAIRA-2 targets research into trustworthy automated radiology reporting, where a draft report and its grounded localizations can be cross-checked against the underlying image. Potential use cases include assisting radiologists with report drafting, surfacing where in an image a finding originates, supporting education by linking report language to anatomy, and serving as a strong open baseline for benchmarking multimodal medical models. Microsoft restricts the release to research use only and explicitly states it is not intended for clinical practice.
By introducing grounded report generation and the RadFact evaluation framework, MAIRA-2 reframed radiology report generation from a pure text task into a spatially grounded one, raising the bar for verifiability in medical vision-language models. Its open weights, CXR-specific RAD-DINO encoder, and detailed benchmarking on MIMIC-CXR and PadChest have made it a widely cited reference point for subsequent grounded reporting and medical multimodal work. Key limitations are its research-only license, single-frontal-image grounding, and reliance on a partly private training corpus, which constrain reproducibility and direct clinical deployment.
Bannur, S., et al. (2024) MAIRA-2: Grounded Radiology Report Generation. arXiv.org.
DOI: 10.48550/arXiv.2406.04449Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data