MUSK

Stanford University / Harvard Medical School

Vision-language foundation model for precision oncology, pretrained on 50M pathology images and 1B text tokens via unified masked modeling.

Released: January 2025

MUSK (Multimodal transformer with Unified maSK modeling) is a vision-language foundation model for computational pathology and precision oncology. Developed by Jinxi Xiang, Xiyue Wang, and senior author Ruijiang Li at Stanford University, with collaborators at Harvard Medical School, it was published in Nature on January 8, 2025. MUSK addresses a central bottleneck in pathology AI: most foundation models learn only from images, while the rich clinical text that accompanies whole-slide images—diagnostic reports, captions, and annotations—is large, abundant, but rarely paired one-to-one with the corresponding histology.

The model's key innovation is a two-stage pretraining recipe that exploits large-scale unpaired image and text data through unified masked modeling, rather than requiring tightly aligned image-caption pairs. This lets MUSK absorb knowledge from far more pathology data than contrastive approaches such as CONCH or PLIP, which depend on curated image-text pairs. A second contrastive alignment stage then ties the two modalities together, yielding a model that handles both pure-vision tasks and joint vision-language reasoning.

MUSK sits alongside contemporaneous pathology foundation models—UNI, Virchow, CONCH, and Prov-GigaPath—but distinguishes itself by coupling a strong visual encoder with an integrated language model and demonstrating clinical outcome prediction, not just diagnostic classification.

Key Features

Unified masked modeling on unpaired data: MUSK pretrains on large pools of pathology images and text independently through masked image and masked language modeling, sidestepping the scarcity of one-to-one image-caption pairs that constrains contrastive methods.
Vision-language integration: Built on the BEiT-3 multiway transformer architecture, MUSK shares parameters across image and text streams, enabling cross-modal retrieval, visual question answering, and zero-shot classification from a single model.
Clinical outcome prediction: Beyond diagnosis, MUSK predicts melanoma relapse, pan-cancer prognosis, and immunotherapy response in lung and gastro-esophageal cancers, linking histology to patient-level decisions.
Broad benchmark coverage: The model is evaluated across 23 patch-level and slide-level benchmarks spanning retrieval, VQA, and classification tasks.
Open weights for research: Pretrained weights are released on Hugging Face under a CC-BY-NC-ND 4.0 license for non-commercial academic use.

Technical Details

MUSK uses the BEiT-3 multiway transformer backbone, with the musk_large_patch16_384 configuration operating on 384-pixel patches. Pretraining draws on roughly 50 million pathology image patches from 11,577 patients across 33 cancer types, together with 1 billion pathology-related text tokens. The first stage applies masked modeling to images and text separately; the second performs contrastive learning to align the two embedding spaces. On evaluation, MUSK delivered strong gains across the 23 benchmarks, and in immunotherapy response prediction it correctly identified responders roughly 83% of the time—about a 12% improvement over competing vision-language models reported in the study.

Applications

MUSK is intended for researchers and clinical investigators in computational pathology and oncology. Its zero-shot and few-shot capabilities support diagnostic image classification, cross-modal retrieval (finding relevant cases or text from an image and vice versa), and pathology visual question answering. The model's outcome-prediction results point toward decision-support tools for prognosis estimation and immunotherapy patient selection, where combining histologic features with clinical text can refine risk stratification beyond what either modality offers alone.

Impact

By demonstrating that unified masked modeling on abundant unpaired data can rival or exceed contrastive image-text pretraining, MUSK offered a data-efficient path for multimodal pathology foundation models and drew notable attention as a Nature publication from the Stanford pathology and radiation oncology groups. Its emphasis on clinical outcomes—relapse, prognosis, and treatment response—helped push the field's framing from diagnostic benchmarking toward precision-oncology endpoints. The released weights have been adopted as a baseline in subsequent pathology foundation model surveys and comparisons. Practical use is constrained by its non-commercial license and the usual caveats of retrospective evaluation, which require prospective validation before clinical deployment.

Citation

A Vision-Language Foundation Model for Precision Oncology

Xiang, J., et al. (2025) A Vision-Language Foundation Model for Precision Oncology. Nature.

DOI: 10.1038/s41586-024-08378-w

Recent citations

Papers that recently cited this model.

Development and evaluation of a large language model-based, retrieval-augmented generation application for query response in early oncology clinical trials
D. Pesantez, G. Fucà, A. Magrì, et al.
ESMO Real World Data and Digital Oncology · Sep 2026
0
Pretraining Multiple Instance Learning Networks with Multi-Teacher Distillation from Pathology Slide Foundation Models
Mingxi Fu, Jiawen Li, Renao Yan, et al.
Jul 2026
0
LaGuadia: Language-Guided Adaptive Distillation from Pathology Foundation Models
Gangsu Kim, Won-Ki Jeong
Jul 2026
0

Top citations

The most-cited papers that cite this model.

Foundation models and intelligent decision-making: Progress, challenges, and perspectives
Jincai Huang, Yongjun Xu, Qi Wang, et al.
Innovation (Cambridge (Mass.)) · May 2025
83
LLM Agents Making Agent Tools
G. Wölflein, Dyke Ferber, D. Truhn, et al.
Annual Meeting of the Association for Computational Linguistics · Feb 2025
48
New horizons at the interface of artificial intelligence and translational cancer research
J. Yates, E. V. Van Allen
Cancer Cell · Apr 2025
35
AIBench: Towards trustworthy evaluation under the 45°law
Zicheng Zhang, Junying Wang, Yijin Guo, et al.
Displays (Guildford) · Oct 2025
34
AI-enabled virtual spatial proteomics from histopathology for interpretable biomarker discovery in lung cancer
Zhe Li, Yuchen Li, Jinxi Xiang, et al.
Nature Medicine · Jan 2026
28

Citations

Total Citations283

Influential21

References75

GitHub

Stars240

Forks29

Open Issues4

Contributors2

Last Push9mo ago

LanguagePython

HuggingFace

Downloads0

Likes59

Last Modified1y ago

Pipelineimage-to-text

Fields of citing research

Medicine95%
Computer Science88%
Biology12%
Engineering12%
Materials Science1%
Mathematics1%
Environmental Science1%
Law1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?12

Reproducibility — can I retrain it?9

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Unified masked modeling on unpaired data: MUSK pretrains on large pools of pathology images and text independently through masked image and masked language modeling, sidestepping the scarcity of one-to-one image-caption pairs that constrains contrastive methods.

Vision-language integration: Built on the BEiT-3 multiway transformer architecture, MUSK shares parameters across image and text streams, enabling cross-modal retrieval, visual question answering, and zero-shot classification from a single model.

Clinical outcome prediction: Beyond diagnosis, MUSK predicts melanoma relapse, pan-cancer prognosis, and immunotherapy response in lung and gastro-esophageal cancers, linking histology to patient-level decisions.

Broad benchmark coverage: The model is evaluated across 23 patch-level and slide-level benchmarks spanning retrieval, VQA, and classification tasks.

Open weights for research: Pretrained weights are released on Hugging Face under a CC-BY-NC-ND 4.0 license for non-commercial academic use.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Development and evaluation of a large language model-based, retrieval-augmented generation application for query response in early oncology clinical trials

D. Pesantez, G. Fucà, A. Magrì, et al.

ESMO Real World Data and Digital Oncology · Sep 2026

Pretraining Multiple Instance Learning Networks with Multi-Teacher Distillation from Pathology Slide Foundation Models

Mingxi Fu, Jiawen Li, Renao Yan, et al.

Jul 2026

LaGuadia: Language-Guided Adaptive Distillation from Pathology Foundation Models

Gangsu Kim, Won-Ki Jeong

Jul 2026

MUSK

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Vision-Language Foundation Model for Precision Oncology

Recent citations

Pretraining Multiple Instance Learning Networks with Multi-Teacher Distillation from Pathology Slide Foundation Models

LaGuadia: Language-Guided Adaptive Distillation from Pathology Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MUSK

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Vision-Language Foundation Model for Precision Oncology

Recent citations

Pretraining Multiple Instance Learning Networks with Multi-Teacher Distillation from Pathology Slide Foundation Models

LaGuadia: Language-Guided Adaptive Distillation from Pathology Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact