bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

MUSK

Stanford University / Harvard Medical School

A vision-language foundation model for precision oncology that pretrains on 50M pathology images and 1B text tokens via unified masked modeling.

Released: January 2025

MUSK (Multimodal transformer with Unified maSK modeling) is a vision-language foundation model for computational pathology and precision oncology. Developed by Jinxi Xiang, Xiyue Wang, and senior author Ruijiang Li at Stanford University, with collaborators at Harvard Medical School, it was published in Nature on January 8, 2025. MUSK addresses a central bottleneck in pathology AI: most foundation models learn only from images, while the rich clinical text that accompanies whole-slide images—diagnostic reports, captions, and annotations—is large, abundant, but rarely paired one-to-one with the corresponding histology.

The model's key innovation is a two-stage pretraining recipe that exploits large-scale unpaired image and text data through unified masked modeling, rather than requiring tightly aligned image-caption pairs. This lets MUSK absorb knowledge from far more pathology data than contrastive approaches such as CONCH or PLIP, which depend on curated image-text pairs. A second contrastive alignment stage then ties the two modalities together, yielding a model that handles both pure-vision tasks and joint vision-language reasoning.

MUSK sits alongside contemporaneous pathology foundation models—UNI, Virchow, CONCH, and Prov-GigaPath—but distinguishes itself by coupling a strong visual encoder with an integrated language model and demonstrating clinical outcome prediction, not just diagnostic classification.

#Key Features

  • Unified masked modeling on unpaired data: MUSK pretrains on large pools of pathology images and text independently through masked image and masked language modeling, sidestepping the scarcity of one-to-one image-caption pairs that constrains contrastive methods.
  • Vision-language integration: Built on the BEiT-3 multiway transformer architecture, MUSK shares parameters across image and text streams, enabling cross-modal retrieval, visual question answering, and zero-shot classification from a single model.
  • Clinical outcome prediction: Beyond diagnosis, MUSK predicts melanoma relapse, pan-cancer prognosis, and immunotherapy response in lung and gastro-esophageal cancers, linking histology to patient-level decisions.
  • Broad benchmark coverage: The model is evaluated across 23 patch-level and slide-level benchmarks spanning retrieval, VQA, and classification tasks.
  • Open weights for research: Pretrained weights are released on Hugging Face under a CC-BY-NC-ND 4.0 license for non-commercial academic use.

#Technical Details

MUSK uses the BEiT-3 multiway transformer backbone, with the musk_large_patch16_384 configuration operating on 384-pixel patches. Pretraining draws on roughly 50 million pathology image patches from 11,577 patients across 33 cancer types, together with 1 billion pathology-related text tokens. The first stage applies masked modeling to images and text separately; the second performs contrastive learning to align the two embedding spaces. On evaluation, MUSK delivered strong gains across the 23 benchmarks, and in immunotherapy response prediction it correctly identified responders roughly 83% of the time—about a 12% improvement over competing vision-language models reported in the study.

#Applications

MUSK is intended for researchers and clinical investigators in computational pathology and oncology. Its zero-shot and few-shot capabilities support diagnostic image classification, cross-modal retrieval (finding relevant cases or text from an image and vice versa), and pathology visual question answering. The model's outcome-prediction results point toward decision-support tools for prognosis estimation and immunotherapy patient selection, where combining histologic features with clinical text can refine risk stratification beyond what either modality offers alone.

#Impact

By demonstrating that unified masked modeling on abundant unpaired data can rival or exceed contrastive image-text pretraining, MUSK offered a data-efficient path for multimodal pathology foundation models and drew notable attention as a Nature publication from the Stanford pathology and radiation oncology groups. Its emphasis on clinical outcomes—relapse, prognosis, and treatment response—helped push the field's framing from diagnostic benchmarking toward precision-oncology endpoints. The released weights have been adopted as a baseline in subsequent pathology foundation model surveys and comparisons. Practical use is constrained by its non-commercial license and the usual caveats of retrospective evaluation, which require prospective validation before clinical deployment.

Citation

A Vision-Language Foundation Model for Precision Oncology

Xiang, J., et al. (2025) A Vision-Language Foundation Model for Precision Oncology. Nature.

DOI: 10.1038/s41586-024-08378-w

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations245
Influential17
References75

GitHub

Stars229
Forks29
Open Issues4
Contributors2
Last Push7mo ago
LanguagePython

HuggingFace

Downloads0
Likes54
Last Modified1y ago
Pipelineimage-to-text

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
12Closed
Usability — can I run it?12
Reproducibility — can I retrain it?9
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

cross_modal_retrievalfoundation_modelhistologyimage_classificationmultimodaloncologyoutcome_predictionself_supervisedtransformervision_transformervisual_question_answeringzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace Model