bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

MAIRA-2

Microsoft Research

Microsoft Research multimodal LLM for grounded chest X-ray report generation, localizing each described finding with bounding boxes on the image.

Released: June 2024
Parameters: 7 Billion

MAIRA-2 is a radiology-specific multimodal large language model from Microsoft Research, introduced in June 2024, that generates findings sections of chest X-ray reports directly from imaging and clinical context. Its defining contribution is grounded report generation: alongside the narrative text, MAIRA-2 emits bounding boxes that localize each described finding on the frontal image, tying the language of a report to concrete spatial evidence. This addresses a long-standing trust and verifiability problem in automated radiology reporting, where free-text generators produce plausible prose without indicating where on the image a finding was observed.

Most prior report-generation systems consume a single image and emit unstructured text, leaving clinicians unable to check whether a stated abnormality corresponds to a real region. MAIRA-2 instead conditions on a richer, more realistic reporting context—the current frontal view, an optional lateral view, a prior frontal image and its report, and structured indication, technique, and comparison fields—mirroring how radiologists actually work. The authors also formalize and benchmark the grounded reporting task and introduce RadFact, an LLM-based metric that scores report correctness and completeness sentence by sentence.

Developed by Shruthi Bannur, Kenza Bouzid, and colleagues at Microsoft Research, MAIRA-2 builds on the earlier MAIRA-1 system and is released with open weights for research use.

#Key Features

  • Grounded findings localization: Each described finding can be accompanied by zero or more bounding boxes on the current frontal image, making generated reports spatially verifiable.
  • Realistic reporting context: Accepts frontal and lateral views, a prior study (image plus report), and indication/technique/comparison fields, rather than a single isolated image.
  • CXR-specialized vision encoder: Uses RAD-DINO-MAIRA-2, a chest-X-ray-tuned image encoder, kept frozen and paired with a trained projection layer into the language model.
  • Multiple output modes: Produces conventional narrative reports, fully grounded reports, or phrase-grounding outputs that localize a supplied finding description.
  • Sentence-level evaluation: Accompanied by RadFact, an LLM-driven metric quantifying correctness and completeness at the level of individual report sentences.

#Technical Details

MAIRA-2 couples a frozen RAD-DINO-MAIRA-2 vision encoder with a projection layer trained from scratch and a fully fine-tuned Vicuna-7B-v1.5 language backbone (roughly 7B parameters). Image embeddings are projected into the LLM token space and interleaved with the structured textual context, allowing a single autoregressive model to handle narrative generation, grounded generation, and phrase grounding. Training drew on a mix of public and private chest X-ray corpora: MIMIC-CXR (USA), PadChest (Spain), and a private USMix set, combining roughly 226,000 ungrounded examples with about 57,000 grounded examples carrying box annotations. On these data, MAIRA-2 reports state-of-the-art results on existing report-generation benchmarks (MIMIC-CXR and PadChest) and establishes baselines for the new grounded reporting task.

#Applications

MAIRA-2 targets research into trustworthy automated radiology reporting, where a draft report and its grounded localizations can be cross-checked against the underlying image. Potential use cases include assisting radiologists with report drafting, surfacing where in an image a finding originates, supporting education by linking report language to anatomy, and serving as a strong open baseline for benchmarking multimodal medical models. Microsoft restricts the release to research use only and explicitly states it is not intended for clinical practice.

#Impact

By introducing grounded report generation and the RadFact evaluation framework, MAIRA-2 reframed radiology report generation from a pure text task into a spatially grounded one, raising the bar for verifiability in medical vision-language models. Its open weights, CXR-specific RAD-DINO encoder, and detailed benchmarking on MIMIC-CXR and PadChest have made it a widely cited reference point for subsequent grounded reporting and medical multimodal work. Key limitations are its research-only license, single-frontal-image grounding, and reliance on a partly private training corpus, which constrain reproducibility and direct clinical deployment.

Citation

MAIRA-2: Grounded Radiology Report Generation

Preprint

Bannur, S., et al. (2024) MAIRA-2: Grounded Radiology Report Generation. arXiv.org.

DOI: 10.48550/arXiv.2406.04449

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations139
Influential21
References80

HuggingFace

Downloads3.2K
Likes74
Last Modified10mo ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
35Closed
Usability — can I run it?43
Reproducibility — can I retrain it?21
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

chest_x_rayinstruction_tuningmultimodalmultimodal_transformerphrase_groundingradiologyradiology_report_generationvision_transformervisual_question_answering

Resources

Research PaperOfficial WebsiteHuggingFace Model