A 3D vision-language foundation model for abdominal CT that pretrains on paired scans, radiology reports, and structured EHR codes for zero-shot interpretation.
Merlin is a three-dimensional vision-language foundation model for abdominal computed tomography (CT), developed by researchers at Stanford University and published in Nature in 2026. It addresses a pressing clinical bottleneck: the volume of abdominal CT studies far outstrips radiologist capacity, yet most medical vision-language models are limited to two-dimensional images and short text, leaving the volumetric nature of CT and the rich clinical context surrounding each scan underused.
The model's central innovation is a multistage pretraining framework that learns from two complementary supervision signals already present in routine clinical data, requiring no additional manual annotation. Structured electronic health record (EHR) diagnosis codes provide weak, label-like supervision, while free-text radiology reports provide descriptive, contrastive supervision. By aligning a 3D CT encoder to both modalities, Merlin produces general-purpose volumetric embeddings that transfer across a broad spectrum of downstream tasks.
Merlin sits alongside CT-focused foundation models such as CT-CLIP and CT-FM, but distinguishes itself by jointly exploiting structured EHR codes and unstructured reports rather than reports alone, and by being released openly with code, weights, and a dataset for the research community.
Merlin pairs an I3D ResNet-152 image encoder with a Clinical Longformer text encoder using a 4,096-token context window, chosen because roughly 21% of report findings sections exceed the 512-token limit of standard encoders. It was pretrained on a clinical dataset of over 6 million CT images from 15,331 scans, more than 1.8 million EHR diagnosis codes, and over 6 million tokens of radiology report text. Evaluation spanned 752 tasks across six categories, validated on 5,137 internal CT examinations and 44,098 external scans from three independent institutions plus two public datasets. Reported results include a zero-shot findings-classification F1 of 0.741 internally and 0.647 externally, a macro-average AUROC of 0.812 across 692 phenotypes, and an AUROC of 0.757 for five-year prediction of six chronic diseases. For segmentation across 20 organs, Merlin (integrated with nnU-Net) outperforms baselines, with the largest gains on smaller and more complex structures.
Merlin is intended as a reusable backbone for abdominal CT analysis in both research and clinical-decision-support settings. Radiologists and clinical informaticians can use its embeddings for zero-shot triage of imaging findings, automated draft report generation, retrieval of similar prior studies, and opportunistic screening for chronic disease risk from scans acquired for unrelated reasons. Because the segmentation, phenotyping, and prediction heads share one pretrained encoder, groups with limited labeled data can fine-tune for new tasks at modest annotation cost, making it especially useful for institutions building CT-based machine-learning pipelines.
By demonstrating that structured EHR codes and free-text reports together yield a strong 3D CT representation without bespoke labeling, Merlin offers a practical template for building volumetric medical foundation models from data hospitals already collect. Its open release of code, weights, and a labeled abdominal CT dataset lowers the barrier for reproducible research in 3D medical imaging, an area historically constrained by data access. Key limitations include its focus on abdominal anatomy, reliance on single-institution pretraining data that may affect generalization, and the careful clinical validation still required before any diagnostic deployment.
Blankemeier, L., et al. (2024) Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset. Nature.
DOI: 10.1038/s41586-026-10181-8Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data