Microsoft Research multimodal LLM that generates the findings section of a chest X-ray report from a single frontal image using a CXR-specific vision encoder and Vicuna-7B.
MAIRA-1 is a radiology-specific multimodal large language model from Microsoft Research, introduced in November 2023, that generates the findings section of a chest X-ray (CXR) report directly from a single frontal image. It was the first model in the MAIRA line and set out to show that pairing a domain-specialized image encoder with a general-purpose large language model could produce radiology reports whose quality approaches what radiologists expect, rather than the generic captions earlier vision-language systems tended to emit.
The core problem MAIRA-1 addresses is automated drafting of free-text radiology reports, a labor-intensive task where small errors carry real clinical weight. Prior report generators frequently relied on image encoders trained on natural images, which struggle to capture the subtle, low-contrast findings characteristic of chest radiographs. MAIRA-1 instead builds on RAD-DINO, a chest-X-ray-tuned vision transformer, and connects it to a fine-tuned Vicuna-7B language backbone, demonstrating that careful domain adaptation of the visual front end is decisive for report quality.
Developed by Stephanie L. Hyland, Shruthi Bannur, Ozan Oktay, and colleagues at Microsoft Research, MAIRA-1 established the architecture and evaluation practices that its successor, MAIRA-2, later extended to grounded, multi-image reporting.
MAIRA-1 consists of three components: a frozen RAD-DINO ViT-B image encoder, a four-layer feedforward adapter module, and a fine-tuned Vicuna-7B language model (roughly 7B parameters). Image embeddings from the encoder are projected by the adapter into the LLM token space, and the language model is trained to autoregressively generate the findings section from this visual context. The model was trained and evaluated on the MIMIC-CXR dataset, processing 377,110 DICOM images. MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered relative to prior baselines, while manual review by radiologists confirmed promising fluency and accuracy and also revealed failure modes not captured by existing automated evaluation practices.
MAIRA-1 targets research into automated chest X-ray report drafting, where a generated findings paragraph can serve as a starting point for radiologist review, support reporting-workflow studies, and act as a strong baseline for benchmarking medical vision-language models. It is positioned as a research artifact rather than a clinical tool: Microsoft frames the MAIRA models as research-only and not intended for diagnostic or treatment decisions, so its primary beneficiaries are researchers studying multimodal medical AI and report-generation evaluation.
As the first MAIRA model, MAIRA-1 demonstrated that a domain-specialized CXR encoder paired with a general LLM could push radiology report generation to then-state-of-the-art quality on MIMIC-CXR, and its emphasis on radiologist-aligned evaluation helped highlight the gap between lexical metrics and clinical correctness. It established the encoder-adapter-LLM recipe and evaluation mindset that MAIRA-2 built on with grounded, multi-image reporting and the RadFact metric. Its main limitations are a single-frontal-image input with no spatial grounding, a research-only license, and the failure modes the authors surfaced that automated metrics do not capture, all of which constrain direct clinical use.
Hyland, S. L., et al. (2023) MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv.org.
DOI: 10.48550/arXiv.2311.13668Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data