bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Imaging

CellViT

Institute for AI in Medicine

Vision Transformer for cell instance segmentation and classification in H&E digital pathology, extended by CellViT++ with foundation model backbones and few-shot adaptation.

Released: 2023

Overview

CellViT is a Vision Transformer framework developed at the Institute for AI in Medicine (IKIM), University Hospital Essen, for cell instance segmentation and classification in Hematoxylin and Eosin (H&E) stained whole-slide images. Rather than treating cell detection and segmentation as separate tasks, CellViT performs simultaneous detection, boundary delineation, and cell-type classification of individual nuclei in a single forward pass over gigapixel pathology scans. The model was published in Medical Image Analysis in 2024, formalizing results from work presented at MICCAI 2023.

The core architectural choice — replacing convolutional encoders with a Vision Transformer backbone initialized from Segment Anything Model (SAM) weights — allows CellViT to capture long-range spatial context across tissue regions. This is particularly valuable in pathology, where the spatial arrangement of cells carries as much diagnostic information as individual cell morphology. The HoVer-Net-inspired output structure, combined with the expressive ViT encoder, enables the model to separate touching or overlapping nuclei more reliably than sliding-window convolutional approaches.

CellViT++ (January 2025) extends the framework by making the encoder interchangeable. Any ViT-based pathology foundation model — including UNI, SAM, and Hibou-L — can be substituted as the backbone, and a lightweight few-shot adaptation module allows users to define novel cell types with as few as five to ten annotated examples per class, without retraining the full model.

Key Features

  • Cell instance segmentation: Detects and delineates individual cell boundaries in H&E whole-slide images, producing instance masks rather than pixel-wise semantic labels alone, enabling cell-level spatial analysis.
  • Multi-class cell classification: Assigns cell-type labels (neoplastic, inflammatory, connective, epithelial, dead) simultaneously with segmentation in a single forward pass, avoiding the overhead of sequential detection and classification pipelines.
  • Swappable foundation model encoders (CellViT++): Treats the ViT encoder as a plug-in slot, allowing pathology foundation models such as UNI or Hibou-L to serve as the backbone and transfer richer domain-specific representations to the segmentation task.
  • Few-shot adaptation (CellViT++): A lightweight linear classifier trained on frozen encoder patch token features supports extension to novel, user-defined cell phenotypes without catastrophic forgetting of pretrained representations.
  • Whole-slide inference pipeline: Includes tooling for tile extraction, stitching, and post-processing to run inference at scale on gigapixel WSIs without manual patch management.
  • Open weights: Multiple model variants — including ViT-256, ViT-H, and a Hibou-L backbone checkpoint — are publicly available on GitHub and HuggingFace.

Technical Details

CellViT builds on the encoder-decoder paradigm introduced by HoVer-Net for nuclear instance segmentation, replacing the convolutional backbone with a Vision Transformer. Three parallel output branches are decoded from the ViT embeddings: a nuclear pixel branch producing a binary foreground map, a HoVer branch predicting horizontal and vertical distance maps from each nuclear centroid (used to separate touching cells at post-processing), and a cell-type branch generating per-pixel class probabilities. The original CellViT uses ViT encoders initialized from SAM pretraining, with ViT-256 processing 256x256 patches at 20x magnification and ViT-H as the larger variant.

Primary training used the PanNuke dataset, a pan-cancer H&E benchmark covering 19 tissue types and approximately 190,000 labeled nuclei across 7,904 image patches. Additional evaluation was conducted on MoNuSeg. On PanNuke, CellViT-SAM-H achieves panoptic quality (PQ) scores exceeding HoVer-Net on the majority of cell classes while running faster due to the non-overlapping tile processing enabled by ViT's fixed-patch tokenization. CellViT++ models using Hibou-L as the encoder improve further on segmentation accuracy relative to general-purpose ViT initialization, validating the benefit of domain-adapted pretraining.

Applications

CellViT is designed for quantitative digital pathology workflows that depend on accurate cell-level analysis. Tumor microenvironment profiling is a primary use case: the model can quantify the density and spatial distribution of immune, tumor, and stromal cells within biopsy sections to support biomarker studies. Clinical research teams use CellViT to automate tumor-infiltrating lymphocyte (TIL) scoring, a prognostic marker in breast cancer and other tumor types. Pathologists and computational biologists can use cell composition features extracted by CellViT as inputs to downstream survival prediction or treatment response models. The few-shot adaptation module in CellViT++ makes the framework practical for research groups working with specialized or newly defined cell phenotypes that fall outside the standard five-class PanNuke taxonomy.

Impact

CellViT addresses a longstanding bottleneck in computational pathology: the difficulty of scaling accurate, cell-level annotations to whole-slide images without prohibitive manual effort. By demonstrating that SAM-initialized ViT encoders outperform or match specialized convolutional architectures on nuclear segmentation benchmarks, the work contributed to a broader shift in the field toward transformer-based pathology models. The CellViT++ extension reflects the emerging ecosystem of pathology foundation models — UNI, Hibou, CONCH, and others — and provides a practical mechanism for composing these representations with specialized downstream heads. Notable limitations include calibration to H&E staining only (IHC and immunofluorescence are not supported without retraining), sensitivity to tile resolution (models are calibrated to 20x or 40x magnification equivalents), and GPU memory requirements for ViT-H variants during whole-slide inference. Stain normalization is recommended when applying the model across institutions with different staining protocols.

Citations

DOI: 10.1016/j.media.2024.103143

DOI: 10.1016/j.media.2024.103143

CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models

Preprint

Hörst, F., et al. (2025) CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models. arXiv.org.

DOI: 10.48550/arXiv.2501.05269

Metrics

GitHub

Stars373
Forks71
Open Issues21
Contributors1
Last Push9mo ago
LanguagePython

Citations

Total Citations26
Influential3
References0

HuggingFace

Downloads0
Likes11
Last Modified1y ago

Tags

segmentationtransformervision modelfoundation modelhistology

Resources

GitHub RepositoryGitHub RepositoryResearch PaperResearch PaperHuggingFace Model