Vision Transformer for cell instance segmentation and classification in H&E digital pathology, extended by CellViT++ with foundation model backbones and few-shot adaptation.
CellViT is a Vision Transformer framework developed at the Institute for AI in Medicine (IKIM), University Hospital Essen, for cell instance segmentation and classification in Hematoxylin and Eosin (H&E) stained whole-slide images. Rather than treating cell detection and segmentation as separate tasks, CellViT performs simultaneous detection, boundary delineation, and cell-type classification of individual nuclei in a single forward pass over gigapixel pathology scans. The model was published in Medical Image Analysis in 2024, formalizing results from work presented at MICCAI 2023.
The core architectural choice — replacing convolutional encoders with a Vision Transformer backbone initialized from Segment Anything Model (SAM) weights — allows CellViT to capture long-range spatial context across tissue regions. This is particularly valuable in pathology, where the spatial arrangement of cells carries as much diagnostic information as individual cell morphology. The HoVer-Net-inspired output structure, combined with the expressive ViT encoder, enables the model to separate touching or overlapping nuclei more reliably than sliding-window convolutional approaches.
CellViT++ (January 2025) extends the framework by making the encoder interchangeable. Any ViT-based pathology foundation model — including UNI, SAM, and Hibou-L — can be substituted as the backbone, and a lightweight few-shot adaptation module allows users to define novel cell types with as few as five to ten annotated examples per class, without retraining the full model.
CellViT builds on the encoder-decoder paradigm introduced by HoVer-Net for nuclear instance segmentation, replacing the convolutional backbone with a Vision Transformer. Three parallel output branches are decoded from the ViT embeddings: a nuclear pixel branch producing a binary foreground map, a HoVer branch predicting horizontal and vertical distance maps from each nuclear centroid (used to separate touching cells at post-processing), and a cell-type branch generating per-pixel class probabilities. The original CellViT uses ViT encoders initialized from SAM pretraining, with ViT-256 processing 256x256 patches at 20x magnification and ViT-H as the larger variant.
Primary training used the PanNuke dataset, a pan-cancer H&E benchmark covering 19 tissue types and approximately 190,000 labeled nuclei across 7,904 image patches. Additional evaluation was conducted on MoNuSeg. On PanNuke, CellViT-SAM-H achieves panoptic quality (PQ) scores exceeding HoVer-Net on the majority of cell classes while running faster due to the non-overlapping tile processing enabled by ViT's fixed-patch tokenization. CellViT++ models using Hibou-L as the encoder improve further on segmentation accuracy relative to general-purpose ViT initialization, validating the benefit of domain-adapted pretraining.
CellViT is designed for quantitative digital pathology workflows that depend on accurate cell-level analysis. Tumor microenvironment profiling is a primary use case: the model can quantify the density and spatial distribution of immune, tumor, and stromal cells within biopsy sections to support biomarker studies. Clinical research teams use CellViT to automate tumor-infiltrating lymphocyte (TIL) scoring, a prognostic marker in breast cancer and other tumor types. Pathologists and computational biologists can use cell composition features extracted by CellViT as inputs to downstream survival prediction or treatment response models. The few-shot adaptation module in CellViT++ makes the framework practical for research groups working with specialized or newly defined cell phenotypes that fall outside the standard five-class PanNuke taxonomy.
CellViT addresses a longstanding bottleneck in computational pathology: the difficulty of scaling accurate, cell-level annotations to whole-slide images without prohibitive manual effort. By demonstrating that SAM-initialized ViT encoders outperform or match specialized convolutional architectures on nuclear segmentation benchmarks, the work contributed to a broader shift in the field toward transformer-based pathology models. The CellViT++ extension reflects the emerging ecosystem of pathology foundation models — UNI, Hibou, CONCH, and others — and provides a practical mechanism for composing these representations with specialized downstream heads. Notable limitations include calibration to H&E staining only (IHC and immunofluorescence are not supported without retraining), sensitivity to tile resolution (models are calibrated to 20x or 40x magnification equivalents), and GPU memory requirements for ViT-H variants during whole-slide inference. Stain normalization is recommended when applying the model across institutions with different staining protocols.
Hörst, F., et al. (2025) CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models. arXiv.org.
DOI: 10.48550/arXiv.2501.05269