Merlin

3D vision-language foundation model for abdominal CT, pretrained on scans, radiology reports, and EHR codes for zero-shot interpretation.

Released: January 2026

Merlin is a three-dimensional vision-language foundation model for abdominal computed tomography (CT), developed by researchers at Stanford University and published in Nature in 2026. It addresses a pressing clinical bottleneck: the volume of abdominal CT studies far outstrips radiologist capacity, yet most medical vision-language models are limited to two-dimensional images and short text, leaving the volumetric nature of CT and the rich clinical context surrounding each scan underused.

The model's central innovation is a multistage pretraining framework that learns from two complementary supervision signals already present in routine clinical data, requiring no additional manual annotation. Structured electronic health record (EHR) diagnosis codes provide weak, label-like supervision, while free-text radiology reports provide descriptive, contrastive supervision. By aligning a 3D CT encoder to both modalities, Merlin produces general-purpose volumetric embeddings that transfer across a broad spectrum of downstream tasks.

Merlin sits alongside CT-focused foundation models such as CT-CLIP and CT-FM, but distinguishes itself by jointly exploiting structured EHR codes and unstructured reports rather than reports alone, and by being released openly with code, weights, and a dataset for the research community.

Key Features

3D volumetric encoding: An inflated 3D (I3D) ResNet-152 image encoder processes full CT volumes (resampled to 1.5 mm in-plane, 3 mm out-of-plane, cropped to 224×224×160), preserving cross-slice anatomical context that 2D models discard.
Dual supervision from clinical data: Pretraining combines a binary cross-entropy loss over 1,692 hierarchical phenotypes derived from ICD codes with an InfoNCE contrastive loss aligning images to radiology report findings.
Annotation-free pretraining: Both supervision signals come from existing EHR and report data, so no manual labeling is required to build the foundation model.
Broad task transfer: A single pretrained backbone supports zero-shot finding classification, phenotype prediction, cross-modal retrieval, chronic disease forecasting, report generation, and organ segmentation.
Open release: Code and model weights are released under an MIT license via GitHub, Hugging Face, and PyPI; the Merlin Abdominal CT dataset is released separately under a Stanford AIMI non-commercial research data-use agreement.

Technical Details

Merlin pairs an I3D ResNet-152 image encoder with a Clinical Longformer text encoder using a 4,096-token context window, chosen because roughly 21% of report findings sections exceed the 512-token limit of standard encoders. It was pretrained on a clinical dataset of over 6 million CT images from 15,331 scans, more than 1.8 million EHR diagnosis codes, and over 6 million tokens of radiology report text. Evaluation spanned 752 tasks across six categories, validated on 5,137 internal CT examinations and 44,098 external scans from three independent institutions plus two public datasets. Reported results include a zero-shot findings-classification F1 of 0.741 internally and 0.647 externally, a macro-average AUROC of 0.812 across 692 phenotypes, and an AUROC of 0.757 for five-year prediction of six chronic diseases. For segmentation across 20 organs, Merlin (integrated with nnU-Net) outperforms baselines, with the largest gains on smaller and more complex structures.

Applications

Merlin is intended as a reusable backbone for abdominal CT analysis in both research and clinical-decision-support settings. Radiologists and clinical informaticians can use its embeddings for zero-shot triage of imaging findings, automated draft report generation, retrieval of similar prior studies, and opportunistic screening for chronic disease risk from scans acquired for unrelated reasons. Because the segmentation, phenotyping, and prediction heads share one pretrained encoder, groups with limited labeled data can fine-tune for new tasks at modest annotation cost, making it especially useful for institutions building CT-based machine-learning pipelines.

Impact

By demonstrating that structured EHR codes and free-text reports together yield a strong 3D CT representation without bespoke labeling, Merlin offers a practical template for building volumetric medical foundation models from data hospitals already collect. Its open release of code, weights, and a labeled abdominal CT dataset lowers the barrier for reproducible research in 3D medical imaging, an area historically constrained by data access. Key limitations include its focus on abdominal anatomy, reliance on single-institution pretraining data that may affect generalization, and the careful clinical validation still required before any diagnostic deployment.

Citation

Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Blankemeier, L., et al. (2024) Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset. Nature.

DOI: 10.1038/s41586-026-10181-8

Recent citations

Papers that recently cited this model.

Fine-Grained Vision-Language Pretraining with Organ-Conditioned Pattern Tokens for CT Understanding
Guoliang You, Xiaomeng Chu
Jul 2026
0Influential
Language-Guided Segmentation of Medical Images: A Review of Foundation Models
Saqib Qamar
Bioengineering · Jul 2026
0
Whole body CT attenuation and volume charts from routine clinical scans via LLM report filtering
Christian Wachinger, B. Renger, Christopher Späth, et al.
npj Digital Medicine · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology
Dyke Ferber, O. E. El Nahhas, G. Wölflein, et al.
Nature Cancer · Jun 2025
122
CLIP in Medical Imaging: A Comprehensive Survey
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
arXiv.org · 2023
107
AbdomenAtlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking
Wenxuan Li, Chongyu Qu, Xiaoxi Chen, et al.
Medical Image Anal. · Jul 2024
85
CLIP in medical imaging: A survey.
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
Medical Image Analysis · Dec 2023
82
Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation
Cheng-Yi Li, Kao-Jung Chang, Cheng-Fu Yang, et al.
Nature Communications · Jul 2024
73

Citations

Total Citations135

Influential17

References32

GitHub

Stars453

Forks55

Open Issues2

Contributors2

Last Push2mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads6.9K

Likes34

Last Modified2mo ago

Pipelinetext-to-image

Fields of citing research

Medicine97%
Computer Science95%
Engineering28%
Biology1%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

54Partial

Usability — can I run it?100

Reproducibility — can I retrain it?7

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

3D volumetric encoding: An inflated 3D (I3D) ResNet-152 image encoder processes full CT volumes (resampled to 1.5 mm in-plane, 3 mm out-of-plane, cropped to 224×224×160), preserving cross-slice anatomical context that 2D models discard.

Dual supervision from clinical data: Pretraining combines a binary cross-entropy loss over 1,692 hierarchical phenotypes derived from ICD codes with an InfoNCE contrastive loss aligning images to radiology report findings.

Annotation-free pretraining: Both supervision signals come from existing EHR and report data, so no manual labeling is required to build the foundation model.

Broad task transfer: A single pretrained backbone supports zero-shot finding classification, phenotype prediction, cross-modal retrieval, chronic disease forecasting, report generation, and organ segmentation.

Open release: Code and model weights are released under an MIT license via GitHub, Hugging Face, and PyPI; the Merlin Abdominal CT dataset is released separately under a Stanford AIMI non-commercial research data-use agreement.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Fine-Grained Vision-Language Pretraining with Organ-Conditioned Pattern Tokens for CT Understanding

Guoliang You, Xiaomeng Chu

Jul 2026

0Influential

Language-Guided Segmentation of Medical Images: A Review of Foundation Models

Saqib Qamar

Bioengineering · Jul 2026

Whole body CT attenuation and volume charts from routine clinical scans via LLM report filtering

Christian Wachinger, B. Renger, Christopher Späth, et al.

npj Digital Medicine · Jul 2026

Merlin

#Key Features

#Technical Details

#Applications

#Impact

Citation

Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Recent citations

Fine-Grained Vision-Language Pretraining with Organ-Conditioned Pattern Tokens for CT Understanding

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Merlin

#Key Features

#Technical Details

#Applications

#Impact

Citation

Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Recent citations

Fine-Grained Vision-Language Pretraining with Organ-Conditioned Pattern Tokens for CT Understanding

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact