bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

Merlin

Stanford University

A 3D vision-language foundation model for abdominal CT that pretrains on paired scans, radiology reports, and structured EHR codes for zero-shot interpretation.

Released: January 2026

Merlin is a three-dimensional vision-language foundation model for abdominal computed tomography (CT), developed by researchers at Stanford University and published in Nature in 2026. It addresses a pressing clinical bottleneck: the volume of abdominal CT studies far outstrips radiologist capacity, yet most medical vision-language models are limited to two-dimensional images and short text, leaving the volumetric nature of CT and the rich clinical context surrounding each scan underused.

The model's central innovation is a multistage pretraining framework that learns from two complementary supervision signals already present in routine clinical data, requiring no additional manual annotation. Structured electronic health record (EHR) diagnosis codes provide weak, label-like supervision, while free-text radiology reports provide descriptive, contrastive supervision. By aligning a 3D CT encoder to both modalities, Merlin produces general-purpose volumetric embeddings that transfer across a broad spectrum of downstream tasks.

Merlin sits alongside CT-focused foundation models such as CT-CLIP and CT-FM, but distinguishes itself by jointly exploiting structured EHR codes and unstructured reports rather than reports alone, and by being released openly with code, weights, and a dataset for the research community.

#Key Features

  • 3D volumetric encoding: An inflated 3D (I3D) ResNet-152 image encoder processes full CT volumes (resampled to 1.5 mm in-plane, 3 mm out-of-plane, cropped to 224×224×160), preserving cross-slice anatomical context that 2D models discard.
  • Dual supervision from clinical data: Pretraining combines a binary cross-entropy loss over 1,692 hierarchical phenotypes derived from ICD codes with an InfoNCE contrastive loss aligning images to radiology report findings.
  • Annotation-free pretraining: Both supervision signals come from existing EHR and report data, so no manual labeling is required to build the foundation model.
  • Broad task transfer: A single pretrained backbone supports zero-shot finding classification, phenotype prediction, cross-modal retrieval, chronic disease forecasting, report generation, and organ segmentation.
  • Open release: Code and model weights are released under an MIT license via GitHub, Hugging Face, and PyPI; the Merlin Abdominal CT dataset is released separately under a Stanford AIMI non-commercial research data-use agreement.

#Technical Details

Merlin pairs an I3D ResNet-152 image encoder with a Clinical Longformer text encoder using a 4,096-token context window, chosen because roughly 21% of report findings sections exceed the 512-token limit of standard encoders. It was pretrained on a clinical dataset of over 6 million CT images from 15,331 scans, more than 1.8 million EHR diagnosis codes, and over 6 million tokens of radiology report text. Evaluation spanned 752 tasks across six categories, validated on 5,137 internal CT examinations and 44,098 external scans from three independent institutions plus two public datasets. Reported results include a zero-shot findings-classification F1 of 0.741 internally and 0.647 externally, a macro-average AUROC of 0.812 across 692 phenotypes, and an AUROC of 0.757 for five-year prediction of six chronic diseases. For segmentation across 20 organs, Merlin (integrated with nnU-Net) outperforms baselines, with the largest gains on smaller and more complex structures.

#Applications

Merlin is intended as a reusable backbone for abdominal CT analysis in both research and clinical-decision-support settings. Radiologists and clinical informaticians can use its embeddings for zero-shot triage of imaging findings, automated draft report generation, retrieval of similar prior studies, and opportunistic screening for chronic disease risk from scans acquired for unrelated reasons. Because the segmentation, phenotyping, and prediction heads share one pretrained encoder, groups with limited labeled data can fine-tune for new tasks at modest annotation cost, making it especially useful for institutions building CT-based machine-learning pipelines.

#Impact

By demonstrating that structured EHR codes and free-text reports together yield a strong 3D CT representation without bespoke labeling, Merlin offers a practical template for building volumetric medical foundation models from data hospitals already collect. Its open release of code, weights, and a labeled abdominal CT dataset lowers the barrier for reproducible research in 3D medical imaging, an area historically constrained by data access. Key limitations include its focus on abdominal anatomy, reliance on single-institution pretraining data that may affect generalization, and the careful clinical validation still required before any diagnostic deployment.

Citation

Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Blankemeier, L., et al. (2024) Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset. Nature.

DOI: 10.1038/s41586-026-10181-8

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations127
Influential17
References32

GitHub

Stars419
Forks54
Open Issues0
Contributors2
Last Push18d ago
LanguagePython
LicenseMIT

HuggingFace

Downloads12.1K
Likes28
Last Modified1mo ago
Pipelinetext-to-image

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
54Partial
Usability — can I run it?100
Reproducibility — can I retrain it?7
open weights, closed recipe
Model Openness Framework
Unclassified
Missing required components

Tags

cnncontrastive_learningctdisease_predictionfoundation_modelmultimodalradiologyreport_generationsegmentationtransformerzero_shot_classification

Resources

GitHub RepositoryResearch PaperHuggingFace Model