bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Imaging

CryoViT

Stanford University

Semi-supervised cryo-ET segmentation framework that adapts DINOv2 vision transformers for 3D organelle annotation using sparse 2D slice labels.

Released: 2024

Overview

CryoViT is a segmentation framework for cryogenic electron tomography (cryo-ET) developed by the Chiu Lab and the Yeung-Levy Lab at Stanford University. It addresses one of the most persistent bottlenecks in structural cell biology: the labor-intensive manual annotation required to delineate large, morphologically complex organelles — such as mitochondria — within 3D tomographic volumes. While cryo-ET uniquely captures subcellular architecture at nanometer resolution in near-native states, its quantitative utility has been limited by the scarcity of automated segmentation tools capable of handling the scale and variability of pleomorphic structures.

CryoViT reframes organelle segmentation by replacing conventional convolutional U-Net architectures with DINOv2, a self-supervised vision transformer pretrained by Meta AI on large-scale natural image datasets. The central insight is that DINOv2's dense patch-level feature representations, despite being learned on photographs, transfer meaningfully to the noisy, low-contrast domain of cryo-ET slices. A lightweight segmentation head is trained on top of these features using sparse 2D annotations drawn from individual cross-sectional slices, and predictions are assembled into coherent 3D volumetric masks across the entire tomogram.

The model was validated on a neuronal cryo-ET dataset derived from induced pluripotent stem cell (iPSC) neurons from Huntington disease (HD) patients and from cultured HD mouse model neurons, demonstrating reliable mitochondrial segmentation across complex, heterogeneous tomographic volumes even under limited-annotation conditions.

Key Features

  • Vision Foundation Model Backbone: Uses DINOv2, a billion-parameter self-supervised ViT, as a frozen or partially fine-tuned feature extractor, providing rich semantic representations that transfer from natural images to cryo-ET slices without task-specific pretraining.
  • Sparse 2D Annotation Strategy: Requires only a small number of labeled 2D cross-sections per tomogram rather than full volumetric ground-truth, substantially reducing the expert annotation burden that is the primary bottleneck in cryo-ET analysis.
  • Coherent 3D Volumetric Output: Despite training on 2D slices, the model assembles predictions into continuous 3D segmentation masks over entire tomographic volumes, enabling downstream morphometric analysis.
  • Low-Data Regime Performance: Outperforms U-Net-based baselines particularly when labeled training examples are scarce, a critical advantage given the cost of expert cryo-ET annotation.
  • Targets Large Pleomorphic Structures: Specifically designed for organelles such as mitochondria that occupy large, shape-variable regions of the tomographic field of view — structures that resist template-matching and particle-picking approaches developed for smaller molecular complexes.
  • Disease Biology Application: Demonstrated on HD patient-derived iPSC neurons and mouse model neurons, establishing a path toward quantitative structural phenotyping in neurodegeneration research.

Technical Details

CryoViT operates as a two-stage pipeline. In the first stage, individual 2D slices extracted from HDF5-formatted tomograms are passed through DINOv2 to generate dense patch-level feature maps. These features are precomputed and cached, avoiding repeated inference through the large backbone during training. In the second stage, a lightweight segmentation head is trained on the cached DINOv2 features using sparse 2D slice annotations. The resulting 2D predictions are then stacked to reconstruct a full 3D segmentation volume.

The training and evaluation pipeline is managed through Hydra (YAML configuration files in src/cryovit/configs), exposing train_model and eval_model entry points. Tomographic data are stored in HDF5 format with raw data under a data key and ground-truth labels under labels/<label_name> keys, supporting efficient random-access loading of individual slices. The primary validation dataset consists of cryo-ET volumes of iPSC-derived neurons from HD patients and cultured HD mouse model neurons, with mitochondria as the annotation target. Quantitative benchmarks comparing CryoViT to U-Net baselines show improved segmentation quality, with the largest gains in low-data regimes where convolutional approaches fail to generalize from scarce training examples.

Applications

CryoViT is intended for structural cell biologists and cryo-ET practitioners who need automated segmentation of large organelles without exhaustive volumetric labeling. Primary use cases include quantifying mitochondrial morphology changes in neurodegenerative disease models (Huntington disease, Parkinson disease) where mitochondrial dysfunction is a central pathological feature, mapping organelle 3D distributions and shape statistics across experimental conditions or patient-derived cell lines, and enabling large-scale structural phenotyping pipelines that would be impractical with manual annotation. The DINOv2-based architecture is extensible: with appropriate labeled data, the segmentation head can in principle be retrained for other large pleomorphic structures beyond mitochondria, broadening applicability to endoplasmic reticulum, lipid droplets, and other organelle classes.

Impact

CryoViT represents an important methodological contribution to the emerging intersection of vision foundation models and structural biology imaging. By demonstrating that features learned on natural images transfer to the challenging low-contrast, high-noise domain of cryo-ET, it opens a broader research direction in which large pretrained vision models reduce the annotation cost across biological imaging modalities. The work was released as a bioRxiv preprint in June 2024 and has not yet undergone formal peer review, a limitation users should weigh when applying the method to critical analyses. Notable technical limitations include the dependence on 2D-to-3D consistency assumptions that may degrade for highly anisotropic structures, the computational overhead of DINOv2 feature extraction, and the fact that systematic benchmarking beyond mitochondria in neuronal cryo-ET has not yet been published. CryoViT is available as an open-source Python package under the Chiu Lab GitHub organization with full documentation at ReadTheDocs.

Citation

CryoViT: Efficient Segmentation of Cryogenic Electron Tomograms with Vision Foundation Models

Preprint

Gupte, S., et al. (2024) CryoViT: Efficient Segmentation of Cryogenic Electron Tomograms with Vision Foundation Models. bioRxiv.

DOI: 10.1101/2024.06.26.600701

Metrics

GitHub

Stars12
Forks1
Open Issues1
Contributors2
Last Push1mo ago
LanguagePython
LicenseMIT

Citations

Total Citations5
Influential0
References64

Tags

segmentationvision transformerfoundation modelsemi-supervisedcryo-ETelectron tomography

Resources

GitHub RepositoryResearch PaperDocumentation