Semi-supervised cryo-ET segmentation framework that adapts DINOv2 vision transformers for 3D organelle annotation using sparse 2D slice labels.
CryoViT is a segmentation framework for cryogenic electron tomography (cryo-ET) developed by the Chiu Lab and the Yeung-Levy Lab at Stanford University. It addresses one of the most persistent bottlenecks in structural cell biology: the labor-intensive manual annotation required to delineate large, morphologically complex organelles — such as mitochondria — within 3D tomographic volumes. While cryo-ET uniquely captures subcellular architecture at nanometer resolution in near-native states, its quantitative utility has been limited by the scarcity of automated segmentation tools capable of handling the scale and variability of pleomorphic structures.
CryoViT reframes organelle segmentation by replacing conventional convolutional U-Net architectures with DINOv2, a self-supervised vision transformer pretrained by Meta AI on large-scale natural image datasets. The central insight is that DINOv2's dense patch-level feature representations, despite being learned on photographs, transfer meaningfully to the noisy, low-contrast domain of cryo-ET slices. A lightweight segmentation head is trained on top of these features using sparse 2D annotations drawn from individual cross-sectional slices, and predictions are assembled into coherent 3D volumetric masks across the entire tomogram.
The model was validated on a neuronal cryo-ET dataset derived from induced pluripotent stem cell (iPSC) neurons from Huntington disease (HD) patients and from cultured HD mouse model neurons, demonstrating reliable mitochondrial segmentation across complex, heterogeneous tomographic volumes even under limited-annotation conditions.
CryoViT operates as a two-stage pipeline. In the first stage, individual 2D slices extracted from HDF5-formatted tomograms are passed through DINOv2 to generate dense patch-level feature maps. These features are precomputed and cached, avoiding repeated inference through the large backbone during training. In the second stage, a lightweight segmentation head is trained on the cached DINOv2 features using sparse 2D slice annotations. The resulting 2D predictions are then stacked to reconstruct a full 3D segmentation volume.
The training and evaluation pipeline is managed through Hydra (YAML configuration files in src/cryovit/configs), exposing train_model and eval_model entry points. Tomographic data are stored in HDF5 format with raw data under a data key and ground-truth labels under labels/<label_name> keys, supporting efficient random-access loading of individual slices. The primary validation dataset consists of cryo-ET volumes of iPSC-derived neurons from HD patients and cultured HD mouse model neurons, with mitochondria as the annotation target. Quantitative benchmarks comparing CryoViT to U-Net baselines show improved segmentation quality, with the largest gains in low-data regimes where convolutional approaches fail to generalize from scarce training examples.
CryoViT is intended for structural cell biologists and cryo-ET practitioners who need automated segmentation of large organelles without exhaustive volumetric labeling. Primary use cases include quantifying mitochondrial morphology changes in neurodegenerative disease models (Huntington disease, Parkinson disease) where mitochondrial dysfunction is a central pathological feature, mapping organelle 3D distributions and shape statistics across experimental conditions or patient-derived cell lines, and enabling large-scale structural phenotyping pipelines that would be impractical with manual annotation. The DINOv2-based architecture is extensible: with appropriate labeled data, the segmentation head can in principle be retrained for other large pleomorphic structures beyond mitochondria, broadening applicability to endoplasmic reticulum, lipid droplets, and other organelle classes.
CryoViT represents an important methodological contribution to the emerging intersection of vision foundation models and structural biology imaging. By demonstrating that features learned on natural images transfer to the challenging low-contrast, high-noise domain of cryo-ET, it opens a broader research direction in which large pretrained vision models reduce the annotation cost across biological imaging modalities. The work was released as a bioRxiv preprint in June 2024 and has not yet undergone formal peer review, a limitation users should weigh when applying the method to critical analyses. Notable technical limitations include the dependence on 2D-to-3D consistency assumptions that may degrade for highly anisotropic structures, the computational overhead of DINOv2 feature extraction, and the fact that systematic benchmarking beyond mitochondria in neuronal cryo-ET has not yet been published. CryoViT is available as an open-source Python package under the Chiu Lab GitHub organization with full documentation at ReadTheDocs.
Gupte, S., et al. (2024) CryoViT: Efficient Segmentation of Cryogenic Electron Tomograms with Vision Foundation Models. bioRxiv.
DOI: 10.1101/2024.06.26.600701