Stanford University / Harvard Medical School
A vision-language foundation model for precision oncology that pretrains on 50M pathology images and 1B text tokens via unified masked modeling.
MUSK (Multimodal transformer with Unified maSK modeling) is a vision-language foundation model for computational pathology and precision oncology. Developed by Jinxi Xiang, Xiyue Wang, and senior author Ruijiang Li at Stanford University, with collaborators at Harvard Medical School, it was published in Nature on January 8, 2025. MUSK addresses a central bottleneck in pathology AI: most foundation models learn only from images, while the rich clinical text that accompanies whole-slide images—diagnostic reports, captions, and annotations—is large, abundant, but rarely paired one-to-one with the corresponding histology.
The model's key innovation is a two-stage pretraining recipe that exploits large-scale unpaired image and text data through unified masked modeling, rather than requiring tightly aligned image-caption pairs. This lets MUSK absorb knowledge from far more pathology data than contrastive approaches such as CONCH or PLIP, which depend on curated image-text pairs. A second contrastive alignment stage then ties the two modalities together, yielding a model that handles both pure-vision tasks and joint vision-language reasoning.
MUSK sits alongside contemporaneous pathology foundation models—UNI, Virchow, CONCH, and Prov-GigaPath—but distinguishes itself by coupling a strong visual encoder with an integrated language model and demonstrating clinical outcome prediction, not just diagnostic classification.
MUSK uses the BEiT-3 multiway transformer backbone, with the musk_large_patch16_384 configuration operating on 384-pixel patches. Pretraining draws on roughly 50 million pathology image patches from 11,577 patients across 33 cancer types, together with 1 billion pathology-related text tokens. The first stage applies masked modeling to images and text separately; the second performs contrastive learning to align the two embedding spaces. On evaluation, MUSK delivered strong gains across the 23 benchmarks, and in immunotherapy response prediction it correctly identified responders roughly 83% of the time—about a 12% improvement over competing vision-language models reported in the study.
MUSK is intended for researchers and clinical investigators in computational pathology and oncology. Its zero-shot and few-shot capabilities support diagnostic image classification, cross-modal retrieval (finding relevant cases or text from an image and vice versa), and pathology visual question answering. The model's outcome-prediction results point toward decision-support tools for prognosis estimation and immunotherapy patient selection, where combining histologic features with clinical text can refine risk stratification beyond what either modality offers alone.
By demonstrating that unified masked modeling on abundant unpaired data can rival or exceed contrastive image-text pretraining, MUSK offered a data-efficient path for multimodal pathology foundation models and drew notable attention as a Nature publication from the Stanford pathology and radiation oncology groups. Its emphasis on clinical outcomes—relapse, prognosis, and treatment response—helped push the field's framing from diagnostic benchmarking toward precision-oncology endpoints. The released weights have been adopted as a baseline in subsequent pathology foundation model surveys and comparisons. Practical use is constrained by its non-commercial license and the usual caveats of retrospective evaluation, which require prospective validation before clinical deployment.
Xiang, J., et al. (2025) A Vision-Language Foundation Model for Precision Oncology. Nature.
DOI: 10.1038/s41586-024-08378-wPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data