Overview

Virchow is a family of self-supervised vision transformer foundation models for digital pathology, developed jointly by Paige AI and Microsoft Research. Named after Rudolf Virchow, the 19th-century founder of cellular pathology, the series applies large-scale self-supervised pre-training to whole slide histopathology images to produce general-purpose tile-level feature extractors. The models are designed to learn rich morphological representations from hematoxylin and eosin (H&E)-stained slides without any labeled data during pre-training.

Three major model generations span the series: the original Virchow, followed by Virchow2 and Virchow2G. Training data scales from 1.5 million to 3.1 million whole slide images sourced from Memorial Sloan Kettering Cancer Center, while model size ranges from 632 million parameters (ViT-H/14) up to 1.9 billion parameters (ViT-G). The original Virchow was posted to arXiv in September 2023 and subsequently published in Nature Medicine in 2024. Virchow2 and Virchow2G were described in a 2024 arXiv preprint, introducing mixed-magnification training and further scaling of both data and model size.

The core motivation is that no single existing pathology model could generalize robustly across diverse cancer types, tissue origins, and clinical tasks. By pre-training on a dataset orders of magnitude larger than those used by prior pathology models, Virchow aims to capture the full breadth of morphological variation present in clinical practice, including rare cancers that lack sufficient labeled examples for conventional supervised training.

Key Features

Massive pre-training corpus: Virchow was trained on 1.5 million H&E-stained whole slide images; Virchow2 and Virchow2G extended this to 3.1 million WSIs spanning diverse tissues, institutions, and staining protocols, representing one of the largest pathology pre-training datasets reported.
Multi-magnification training: Virchow2 introduced mixed-magnification training at 5x, 10x, 20x, and 40x simultaneously, enabling the model to learn features at multiple spatial scales within a single training run rather than being locked to a single resolution.
Scalable model family: The series spans 632M parameters (ViT-H/14, Virchow and Virchow2), 1.9B parameters (ViT-G, Virchow2G), and a 22M-parameter distilled variant (Virchow2G Mini), offering deployment options across compute budgets.
Rich patch-level embeddings: Each tile produces both a class token (1 x 1,280) and 256 patch tokens (256 x 1,280); standard downstream use concatenates the class token with mean-pooled patch tokens to yield a 2,560-dimensional embedding suitable for dense and sparse tasks alike.
Rare cancer generalization: Despite not being trained with cancer-type labels, Virchow achieved an AUC of 0.937 across seven rare cancer types not represented during training, demonstrating strong out-of-distribution generalization from self-supervised pre-training alone.

Technical Details

All Virchow models use a Vision Transformer (ViT) backbone pre-trained with a DINOv2-derived self-supervised objective. The original Virchow and Virchow2 share the ViT-H/14 architecture with 32 transformer blocks, an embedding dimension of 1,280, 16 attention heads, and SwiGLU activations, totaling 632 million parameters. Virchow2G uses a larger ViT-G backbone with approximately 1.9 billion parameters. A knowledge-distilled variant, Virchow2G Mini, compresses this to 22 million parameters for efficient inference.

Virchow2 introduced two refinements over the original: four register tokens to suppress artifact features in attention maps, and a modified DINOv2 training objective that replaces the KoLeo regularizer with a kernel density estimator. Input resolution is 224 x 224 pixels at 0.5 microns per pixel (equivalent to 20x magnification) for Virchow; Virchow2 extends this with mixed-magnification sampling at 2.0, 1.0, 0.5, and 0.25 microns per pixel. In benchmark evaluations, Virchow achieved an AUC of 0.949 across 17 cancer types in pan-cancer detection (Nature Medicine 2024), while Virchow2 reached state-of-the-art results on 12 tile-level pathology benchmark tasks. Models are loaded via the timm library and are compatible with PyTorch 2.0+ with fp16 mixed-precision inference supported.

Applications

Virchow models function as general-purpose tile-level feature extractors that can be integrated into a wide range of computational pathology workflows. Pathology AI teams use the frozen embeddings as input to lightweight classifiers for cancer detection, subtype classification, and histological grading — tasks where labeled data is scarce but unlabeled slides are abundant. Biomarker prediction pipelines use H&E morphology alone to infer molecular markers such as microsatellite instability or mutation status, potentially reducing the need for expensive molecular assays. For rare cancer identification, the model's demonstrated out-of-distribution generalization makes it particularly valuable in settings where supervised training is impractical. Patch token outputs support pixel-level and region-level tasks such as tissue segmentation. Research groups integrating multi-modal data combine Virchow embeddings with genomic or clinical covariates in prognostic and treatment-response models.

Impact

Virchow's publication in Nature Medicine in 2024 established it as one of the most rigorously validated pathology foundation models, demonstrating that self-supervised pre-training on millions of clinical slides can produce embeddings competitive with or superior to supervised models on a broad range of tasks. The finding that mixed-magnification training and data diversity matter more than parameter count alone (from the Virchow2 paper) is an important empirical result for the field of digital pathology. Model weights for Virchow and Virchow2 are publicly available on HuggingFace under Apache 2.0 and CC-BY-NC-ND-4.0 licenses respectively, though access requires institutional registration and approval. Notable limitations include the H&E-centric training corpus (performance on immunohistochemistry or other staining protocols is uncharacterized), the proprietary and non-reproducible training data from Memorial Sloan Kettering Cancer Center, and the tile-level output that requires a separate aggregation strategy for slide-level inference. Virchow and Virchow2 are research tools and have not received regulatory approval for clinical diagnostic use.

Citations

A foundation model for clinical-grade computational pathology and rare cancers detection

Vorontsov, E., et al. (2024) A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine.

DOI: 10.1038/s41591-024-03141-0

Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology

Preprint

Zimmermann, E., et al. (2024) Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology. arXiv.org.

DOI: 10.48550/arXiv.2408.00738

Overview

Key Features

Massive pre-training corpus: Virchow was trained on 1.5 million H&E-stained whole slide images; Virchow2 and Virchow2G extended this to 3.1 million WSIs spanning diverse tissues, institutions, and staining protocols, representing one of the largest pathology pre-training datasets reported.

Multi-magnification training: Virchow2 introduced mixed-magnification training at 5x, 10x, 20x, and 40x simultaneously, enabling the model to learn features at multiple spatial scales within a single training run rather than being locked to a single resolution.

Scalable model family: The series spans 632M parameters (ViT-H/14, Virchow and Virchow2), 1.9B parameters (ViT-G, Virchow2G), and a 22M-parameter distilled variant (Virchow2G Mini), offering deployment options across compute budgets.

Rich patch-level embeddings: Each tile produces both a class token (1 x 1,280) and 256 patch tokens (256 x 1,280); standard downstream use concatenates the class token with mean-pooled patch tokens to yield a 2,560-dimensional embedding suitable for dense and sparse tasks alike.

Rare cancer generalization: Despite not being trained with cancer-type labels, Virchow achieved an AUC of 0.937 across seven rare cancer types not represented during training, demonstrating strong out-of-distribution generalization from self-supervised pre-training alone.

Technical Details

Applications

Impact

Citations

A foundation model for clinical-grade computational pathology and rare cancers detection

Vorontsov, E., et al. (2024) A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine.

DOI: 10.1038/s41591-024-03141-0

Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology

Preprint

Zimmermann, E., et al. (2024) Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology. arXiv.org.

DOI: 10.48550/arXiv.2408.00738

Virchow

Overview

Key Features

Technical Details

Applications

Impact

Citations

A foundation model for clinical-grade computational pathology and rare cancers detection

Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology

Metrics

Citations

HuggingFace

Tags

Resources

Virchow

Overview

Key Features

Technical Details

Applications

Impact

Citations

A foundation model for clinical-grade computational pathology and rare cancers detection

Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology

Metrics

Citations

HuggingFace

Tags

Resources