MAIRA-1

Radiology-specific multimodal LLM that generates the findings section of a chest X-ray report from a frontal image, pairing RAD-DINO with Vicuna-7B.

Released: November 2023

Parameters: 7 Billion

MAIRA-1 is a radiology-specific multimodal large language model from Microsoft Research, introduced in November 2023, that generates the findings section of a chest X-ray (CXR) report directly from a single frontal image. It was the first model in the MAIRA line and set out to show that pairing a domain-specialized image encoder with a general-purpose large language model could produce radiology reports whose quality approaches what radiologists expect, rather than the generic captions earlier vision-language systems tended to emit.

The core problem MAIRA-1 addresses is automated drafting of free-text radiology reports, a labor-intensive task where small errors carry real clinical weight. Prior report generators frequently relied on image encoders trained on natural images, which struggle to capture the subtle, low-contrast findings characteristic of chest radiographs. MAIRA-1 instead builds on RAD-DINO, a chest-X-ray-tuned vision transformer, and connects it to a fine-tuned Vicuna-7B language backbone, demonstrating that careful domain adaptation of the visual front end is decisive for report quality.

Developed by Stephanie L. Hyland, Shruthi Bannur, Ozan Oktay, and colleagues at Microsoft Research, MAIRA-1 established the architecture and evaluation practices that its successor, MAIRA-2, later extended to grounded, multi-image reporting.

Key Features

CXR-specialized vision encoder: Uses a frozen RAD-DINO ViT-B encoder, trained on chest radiographs with DINOv2 self-supervision at 518-pixel resolution, helping it surface small or subtle findings such as a pneumothorax.
LLM-based report generation: Couples the image encoder to a fine-tuned Vicuna-7B language model, producing fluent narrative findings text rather than short labels or templated captions.
Lightweight adapter bridge: A four-layer feedforward adapter projects image embeddings into the language model's token space, keeping the vision encoder frozen while adapting the connection.
Text-based data augmentation: Augmentation of the report text during training improves robustness and contributes to the model's gains on radiologist-aligned metrics.
Radiologist-aligned evaluation: Reported on RadCliQ and a manual radiologist review, exposing failure modes that purely lexical metrics miss.

Technical Details

MAIRA-1 consists of three components: a frozen RAD-DINO ViT-B image encoder, a four-layer feedforward adapter module, and a fine-tuned Vicuna-7B language model (roughly 7B parameters). Image embeddings from the encoder are projected by the adapter into the LLM token space, and the language model is trained to autoregressively generate the findings section from this visual context. The model was trained and evaluated on the MIMIC-CXR dataset, processing 377,110 DICOM images. MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered relative to prior baselines, while manual review by radiologists confirmed promising fluency and accuracy and also revealed failure modes not captured by existing automated evaluation practices.

Applications

MAIRA-1 targets research into automated chest X-ray report drafting, where a generated findings paragraph can serve as a starting point for radiologist review, support reporting-workflow studies, and act as a strong baseline for benchmarking medical vision-language models. It is positioned as a research artifact rather than a clinical tool: Microsoft frames the MAIRA models as research-only and not intended for diagnostic or treatment decisions, so its primary beneficiaries are researchers studying multimodal medical AI and report-generation evaluation.

Impact

As the first MAIRA model, MAIRA-1 demonstrated that a domain-specialized CXR encoder paired with a general LLM could push radiology report generation to then-state-of-the-art quality on MIMIC-CXR, and its emphasis on radiologist-aligned evaluation helped highlight the gap between lexical metrics and clinical correctness. It established the encoder-adapter-LLM recipe and evaluation mindset that MAIRA-2 built on with grounded, multi-image reporting and the RadFact metric. Its main limitations are a single-frontal-image input with no spatial grounding, a research-only license, and the failure modes the authors surfaced that automated metrics do not capture, all of which constrain direct clinical use.

Citation

MAIRA-1: A specialised large multimodal model for radiology report generation

Preprint

Hyland, S. L., et al. (2023) MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv.org.

DOI: 10.48550/arXiv.2311.13668

Recent citations

Papers that recently cited this model.

Discrete Diffusion Language Models for Interactive Radiology Report Drafting
Max Van Puyvelde, Halil Ibrahim Gulluk, W. Criekinge, et al.
Jul 2026
0
CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays
Geon Choi, Hangyul Yoon, Nalee Kim, et al.
Jun 2026
0
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
Shiying Yu, Jie Wang, Guoming Lu
May 2026
0

Top citations

The most-cited papers that cite this model.

A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Hongjian Zhou, Boyang Gu, Xinyu Zou, et al.
arXiv.org · Nov 2023
234Influential
MAIRA-2: Grounded Radiology Report Generation
Shruthi Bannur, Kenza Bouzid, D. C. Castro, et al.
arXiv.org · Jun 2024
139
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology
Nur Yildirim, Hannah Richardson, M. Wetscherek, et al.
International Conference on Human Factors in Computing Systems · Feb 2024
107Influential
Application of large language models in medicine
Fenglin Liu, Hongjian Zhou, Boyang Gu, et al.
Nature Reviews Bioengineering · Apr 2025
95
Multimodal generative AI for medical image interpretation
Vishwanatha M. Rao, Michael Hla, Michael Moor, et al.
Nature · Mar 2025
92Influential

Citations

Total Citations92

Influential9

References48

Fields of citing research

Medicine98%
Computer Science96%
Engineering17%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

6Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website

Key Features

CXR-specialized vision encoder: Uses a frozen RAD-DINO ViT-B encoder, trained on chest radiographs with DINOv2 self-supervision at 518-pixel resolution, helping it surface small or subtle findings such as a pneumothorax.

LLM-based report generation: Couples the image encoder to a fine-tuned Vicuna-7B language model, producing fluent narrative findings text rather than short labels or templated captions.

Lightweight adapter bridge: A four-layer feedforward adapter projects image embeddings into the language model's token space, keeping the vision encoder frozen while adapting the connection.

Text-based data augmentation: Augmentation of the report text during training improves robustness and contributes to the model's gains on radiologist-aligned metrics.

Radiologist-aligned evaluation: Reported on RadCliQ and a manual radiologist review, exposing failure modes that purely lexical metrics miss.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Max Van Puyvelde, Halil Ibrahim Gulluk, W. Criekinge, et al.

Jul 2026

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

Geon Choi, Hangyul Yoon, Nalee Kim, et al.

Jun 2026

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Shiying Yu, Jie Wang, Guoming Lu

May 2026

MAIRA-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

MAIRA-1: A specialised large multimodal model for radiology report generation

Recent citations

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

MAIRA-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

MAIRA-1: A specialised large multimodal model for radiology report generation

Recent citations

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact