Overview

CryoSAM is a training-free segmentation framework developed by the Xu Lab at Carnegie Mellon University that adapts 2D vision foundation models for 3D segmentation of cryogenic electron tomography (cryo-ET) volumes. Published at MICCAI 2024, CryoSAM enables complete tomogram semantic segmentation of subcellular structures from a single user-provided point prompt, with no fine-tuning or task-specific training data required.

Cryo-ET produces 3D volumetric images of cells and molecular assemblies at near-native conditions, but extracting meaningful structural information has historically depended on laborious manual annotation. Prior computational approaches relied on supervised learning or few-shot methods that still required labeled examples. CryoSAM breaks this dependency by operating in a fully zero-shot, training-free regime, making it immediately applicable to new particle types and organisms without any model retraining.

The framework bridges the domain gap between 2D natural image foundation models and the noisy, low-contrast environment of cryo-ET data through two complementary mechanisms: Cross-Plane Self-Prompting, which propagates SAM segmentations plane-by-plane through the volume, and Hierarchical Feature Matching, which uses DINOv2 features derived from the single prompted particle to locate all similar instances across the full tomogram. Together these components turn one user click into a fully segmented volume.

Key Features

Training-free zero-shot inference: No fine-tuning or labeled cryo-ET data is required; the framework operates directly from a single user point prompt using pre-trained foundation model weights.
Cross-Plane Self-Prompting: SAM segmentation masks are recursively propagated from one 2D plane to adjacent planes along the Z-axis, building coherent 3D instance segmentation without a dedicated 3D model.
Hierarchical Feature Matching: A coarse-to-fine DINOv2 feature matching pipeline locates all instances of the prompted particle type across the full tomogram, reducing naive dense matching runtime by approximately 95%.
Full tomogram coverage: A single prompted instance is sufficient to drive semantic segmentation across the entire 3D volume, covering all occurrences of that particle type.
Foundation model synergy: SAM (ViT-H backbone) handles prompted 2D segmentation while DINOv2 (ViT-L/14 backbone) provides discriminative patch-level features for cross-instance retrieval, outperforming SAM encoder features alone in matching tasks.

Technical Details

CryoSAM does not train a new model but instead orchestrates two pre-trained foundation models from Meta AI. DINOv2 with a ViT-L/14 backbone serves as the feature extractor: its self-supervised training on large natural image collections yields highly discriminative patch-level representations that transfer across the domain gap to cryo-ET data, where DINO and DINOv2 features substantially outperform SAM encoder features in segmentation quality. SAM with a ViT-H backbone serves as the prompted 2D segmentation engine.

The Cross-Plane Self-Prompting pipeline works as follows: given a user point prompt on a single XY slice, SAM segments the particle in that plane; the resulting mask centroid or bounding box becomes the prompt for the adjacent slice, continuing recursively in both directions along Z to produce a 3D instance segmentation. The Hierarchical Feature Matching step then computes a mean DINOv2 feature vector from the prompted particle's image patches, applies a coarse filtering pass to eliminate dissimilar tomogram regions, proposes point prompts in surviving high-similarity areas, and runs the plane-by-plane segmentation pipeline on each candidate. Validated on a public ribosome dataset from cryo-ET tomograms of Mycoplasma pneumoniae, CryoSAM with a single prompt exceeded supervised and few-shot baselines operating under 10% annotation budgets, while the hierarchical matching scheme reduced runtime by roughly 95% compared to exhaustive dense matching.

Applications

CryoSAM is designed for structural biology workflows centered on cryo-ET analysis. Its primary use case is particle picking — locating and segmenting macromolecular complexes such as ribosomes and protein assemblies in tomograms prior to subtomogram averaging. It is equally suited to subcellular structure segmentation, identifying organelles and membranes in cellular tomograms without ground-truth annotations. Researchers can use CryoSAM for annotation-free exploration of tomogram content during pilot experiments, and for generating instance segmentation masks that bootstrap labeled datasets for downstream supervised models. The framework is implemented within the aitom Python toolkit maintained by the Xu Lab, which provides a broader set of cryo-ET analysis utilities including subtomogram alignment and classification.

Impact

CryoSAM demonstrates that large 2D vision foundation models can be repurposed for specialized 3D biological imaging tasks without domain-specific training, a finding with broad implications for cryo-ET and related volumetric imaging modalities. Its acceptance at MICCAI 2024 positions it as a practical tool for the structural biology community, particularly for labs that lack the labeled training data required by supervised approaches. Notable limitations include sensitivity to low-contrast or very noisy tomograms where the domain gap remains a challenge, an inherent per-run constraint of one particle type per prompted instance, and the inability to benefit from existing cryo-ET annotations that could improve accuracy. Quantitative benchmarks are currently reported on a single public ribosome dataset, and broader validation across diverse particle types and organisms remains an area for future evaluation.

Citation

CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models

Zhao, Y., et al. (2024) CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-72111-3_12

Overview

Key Features

Training-free zero-shot inference: No fine-tuning or labeled cryo-ET data is required; the framework operates directly from a single user point prompt using pre-trained foundation model weights.

Cross-Plane Self-Prompting: SAM segmentation masks are recursively propagated from one 2D plane to adjacent planes along the Z-axis, building coherent 3D instance segmentation without a dedicated 3D model.

Hierarchical Feature Matching: A coarse-to-fine DINOv2 feature matching pipeline locates all instances of the prompted particle type across the full tomogram, reducing naive dense matching runtime by approximately 95%.

Full tomogram coverage: A single prompted instance is sufficient to drive semantic segmentation across the entire 3D volume, covering all occurrences of that particle type.

Foundation model synergy: SAM (ViT-H backbone) handles prompted 2D segmentation while DINOv2 (ViT-L/14 backbone) provides discriminative patch-level features for cross-instance retrieval, outperforming SAM encoder features alone in matching tasks.

Technical Details

Applications

Impact

Citation

CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models

Zhao, Y., et al. (2024) CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.1007/978-3-031-72111-3_12

CryoSAM

Overview

Key Features

Technical Details

Applications

Impact

Citation

CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models

Metrics

GitHub

Citations

Tags

Resources

CryoSAM

Overview

Key Features

Technical Details

Applications

Impact

Citation

CryoSAM: Training-Free CryoET Tomogram Segmentation with Foundation Models

Metrics

GitHub

Citations

Tags

Resources