bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

MAIRA-1

Microsoft Research

Microsoft Research multimodal LLM that generates the findings section of a chest X-ray report from a single frontal image using a CXR-specific vision encoder and Vicuna-7B.

Released: November 2023
Parameters: 7 Billion

MAIRA-1 is a radiology-specific multimodal large language model from Microsoft Research, introduced in November 2023, that generates the findings section of a chest X-ray (CXR) report directly from a single frontal image. It was the first model in the MAIRA line and set out to show that pairing a domain-specialized image encoder with a general-purpose large language model could produce radiology reports whose quality approaches what radiologists expect, rather than the generic captions earlier vision-language systems tended to emit.

The core problem MAIRA-1 addresses is automated drafting of free-text radiology reports, a labor-intensive task where small errors carry real clinical weight. Prior report generators frequently relied on image encoders trained on natural images, which struggle to capture the subtle, low-contrast findings characteristic of chest radiographs. MAIRA-1 instead builds on RAD-DINO, a chest-X-ray-tuned vision transformer, and connects it to a fine-tuned Vicuna-7B language backbone, demonstrating that careful domain adaptation of the visual front end is decisive for report quality.

Developed by Stephanie L. Hyland, Shruthi Bannur, Ozan Oktay, and colleagues at Microsoft Research, MAIRA-1 established the architecture and evaluation practices that its successor, MAIRA-2, later extended to grounded, multi-image reporting.

#Key Features

  • CXR-specialized vision encoder: Uses a frozen RAD-DINO ViT-B encoder, trained on chest radiographs with DINOv2 self-supervision at 518-pixel resolution, helping it surface small or subtle findings such as a pneumothorax.
  • LLM-based report generation: Couples the image encoder to a fine-tuned Vicuna-7B language model, producing fluent narrative findings text rather than short labels or templated captions.
  • Lightweight adapter bridge: A four-layer feedforward adapter projects image embeddings into the language model's token space, keeping the vision encoder frozen while adapting the connection.
  • Text-based data augmentation: Augmentation of the report text during training improves robustness and contributes to the model's gains on radiologist-aligned metrics.
  • Radiologist-aligned evaluation: Reported on RadCliQ and a manual radiologist review, exposing failure modes that purely lexical metrics miss.

#Technical Details

MAIRA-1 consists of three components: a frozen RAD-DINO ViT-B image encoder, a four-layer feedforward adapter module, and a fine-tuned Vicuna-7B language model (roughly 7B parameters). Image embeddings from the encoder are projected by the adapter into the LLM token space, and the language model is trained to autoregressively generate the findings section from this visual context. The model was trained and evaluated on the MIMIC-CXR dataset, processing 377,110 DICOM images. MAIRA-1 significantly improves on the radiologist-aligned RadCliQ metric and across all lexical metrics considered relative to prior baselines, while manual review by radiologists confirmed promising fluency and accuracy and also revealed failure modes not captured by existing automated evaluation practices.

#Applications

MAIRA-1 targets research into automated chest X-ray report drafting, where a generated findings paragraph can serve as a starting point for radiologist review, support reporting-workflow studies, and act as a strong baseline for benchmarking medical vision-language models. It is positioned as a research artifact rather than a clinical tool: Microsoft frames the MAIRA models as research-only and not intended for diagnostic or treatment decisions, so its primary beneficiaries are researchers studying multimodal medical AI and report-generation evaluation.

#Impact

As the first MAIRA model, MAIRA-1 demonstrated that a domain-specialized CXR encoder paired with a general LLM could push radiology report generation to then-state-of-the-art quality on MIMIC-CXR, and its emphasis on radiologist-aligned evaluation helped highlight the gap between lexical metrics and clinical correctness. It established the encoder-adapter-LLM recipe and evaluation mindset that MAIRA-2 built on with grounded, multi-image reporting and the RadFact metric. Its main limitations are a single-frontal-image input with no spatial grounding, a research-only license, and the failure modes the authors surfaced that automated metrics do not capture, all of which constrain direct clinical use.

Citation

MAIRA-1: A specialised large multimodal model for radiology report generation

Preprint

Hyland, S. L., et al. (2023) MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv.org.

DOI: 10.48550/arXiv.2311.13668

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations87
Influential9
References48

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
6Closed
Usability — can I run it?7
Reproducibility — can I retrain it?0
not reproducible
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

chest_x_raymultimodalmultimodal_transformerradiologyradiology_report_generationtransfer_learningvision_transformer

Resources

Research PaperOfficial Website