Overview

Ensemble 3D UNet Soup is the eighth-place solution from the CZII CryoET Object Identification Kaggle competition organized by the Chan Zuckerberg Imaging Institute (CZII) from November 2024 to February 2025. The model is now hosted on the CZ Virtual Cells Platform (v1.0) as part of CZII's effort to make top-performing particle picking algorithms accessible to the structural biology community through the copick ecosystem.

The approach takes its name from the "model soup" technique, in which multiple independently trained models are combined through weighted averaging of their parameters or predictions to produce an ensemble that consistently outperforms any single constituent. Applied to the 3D U-Net architecture family, this strategy trains three model variants at different scales (tiny, medium, and large), pretrains each on the available simulated cryo-ET data to establish strong baseline representations, and then fine-tunes each variant on the experimental annotated tomograms. The final ensemble combines these three models with test-time augmentation, producing a robust particle picker that benefits from the complementary inductive biases of different model capacities.

The CZII competition attracted 1,135 participants and challenged them to automatically detect six protein complexes in cryo-ET volumes: apoferritin, beta-amylase, beta-galactosidase, 80S ribosomes, thyroglobulin, and virus-like particles. The eighth-place result demonstrates that a careful application of established deep learning techniques — model soups, pretraining on simulation, and test-time augmentation — can achieve competitive performance on a specialized biological imaging task without requiring architectural novelty.

Key Features

Three-scale U-Net ensemble: Independently trains tiny, medium, and large 3D U-Net variants with varying depth and channel multipliers, exploiting complementary feature hierarchies at different model capacities to improve ensemble robustness.
Model soup aggregation: Combines the three U-Net variants through weighted averaging of their heatmap predictions, rather than voting or selecting the best single model, providing smooth calibrated probability outputs that benefit downstream peak detection.
Simulated-to-real pretraining: Each model is first pretrained on the 27 simulated tomographic runs from the CZII phantom dataset before fine-tuning on the 6 experimental runs with ground truth annotations, using the abundant simulated data to learn robust feature representations before adapting to real noise characteristics.
Test-time augmentation: Applies random flips and 90-degree rotations to each input volume at inference time and averages predictions across augmented views, providing additional variance reduction without requiring separate model training.
Multi-preprocessing fine-tuning: Trains some ensemble members on denoised tomograms and others on CTF-deconvolved and IsoNet-corrected tomograms, ensuring the ensemble spans different preprocessing strategies that may emphasize complementary structural signals.
Copick integration: Available on the CZ Virtual Cells Platform with full copick API compatibility, enabling direct inference on datasets registered with the CZII data portal ecosystem.

Technical Details

The core architecture is the 3D U-Net, implemented with three scale variants differentiated by the number of encoder-decoder levels and feature channel counts. The "tiny" variant uses shallower encoding with fewer channels, providing fast inference at the cost of some spatial context; the "medium" variant follows standard U-Net depth; and the "large" variant uses additional encoder levels to capture coarser contextual features at the expense of memory and compute. All three variants output multi-channel 3D heatmaps, one channel per target particle class, trained with a regression loss against Gaussian-blurred ground truth coordinate maps.

Training patches of 128×128×128 voxels are extracted from full tomographic volumes during both pretraining and fine-tuning, with standard augmentations including random axis-aligned flips and small intensity perturbations. The simulated pretraining phase uses all 27 simulated runs from the CZII Kaggle phantom dataset, which were generated with realistic cryo-ET noise models, missing wedge geometry, and particle packing densities calibrated to match experimental conditions. Fine-tuning adapts the pretrained weights to the 6 experimental tomograms, training separate ensemble members on different preprocessed versions of these tomograms — denoised (via IsoNet or similar), CTF-deconvolved, and unprocessed — so that the ensemble spans the preprocessing variation present in real experimental data. The final model soup weights are determined by grid search on the validation set.

Applications

Ensemble 3D UNet Soup is most directly applicable to cryo-ET subtomogram averaging pipelines targeting the six protein complexes in the CZII phantom benchmark. Because the model was systematically trained across multiple preprocessing variants, it may exhibit better robustness than single-preprocessing pickers when applied to experimental datasets with variable data quality or when users are uncertain about the optimal preprocessing strategy for their instrument conditions. The model soup approach also provides a principled baseline for practitioners who want to extend the method to new targets: training a new set of U-Nets at multiple scales on a new annotated dataset and applying the same soup aggregation strategy is straightforward using the published code. The copick-compatible output format allows picked coordinates to feed directly into CZ cryo-ET Data Portal workflows for collaborative visualization and quality assessment.

Impact

The Ensemble 3D UNet Soup illustrates the effectiveness of combining classical deep learning engineering practices — model soups, pretraining on simulation, test-time augmentation — in the context of cryo-ET particle picking, achieving eighth place among 1,135 competition entrants without requiring novel architectural components. Its inclusion on the CZ Virtual Cells Platform gives users an alternative to the higher-ranked TopCUP model, providing a different ensemble strategy (multiple U-Net scales vs. EfficientNet encoder variants) that may perform better or worse depending on the specific target complex and data characteristics. As of early 2026, the approach has not been described in a standalone peer-reviewed publication; the CZII competition outcomes were summarized in an accompanying preprint covering lessons from the challenge. Users should fine-tune on their own annotated data when applying to complexes outside the six phantom benchmark targets.

Overview

Key Features

Three-scale U-Net ensemble: Independently trains tiny, medium, and large 3D U-Net variants with varying depth and channel multipliers, exploiting complementary feature hierarchies at different model capacities to improve ensemble robustness.

Model soup aggregation: Combines the three U-Net variants through weighted averaging of their heatmap predictions, rather than voting or selecting the best single model, providing smooth calibrated probability outputs that benefit downstream peak detection.

Simulated-to-real pretraining: Each model is first pretrained on the 27 simulated tomographic runs from the CZII phantom dataset before fine-tuning on the 6 experimental runs with ground truth annotations, using the abundant simulated data to learn robust feature representations before adapting to real noise characteristics.

Test-time augmentation: Applies random flips and 90-degree rotations to each input volume at inference time and averages predictions across augmented views, providing additional variance reduction without requiring separate model training.

Multi-preprocessing fine-tuning: Trains some ensemble members on denoised tomograms and others on CTF-deconvolved and IsoNet-corrected tomograms, ensuring the ensemble spans different preprocessing strategies that may emphasize complementary structural signals.

Copick integration: Available on the CZ Virtual Cells Platform with full copick API compatibility, enabling direct inference on datasets registered with the CZII data portal ecosystem.

Technical Details

Applications

Impact

Ensemble 3D UNet Soup

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

Ensemble 3D UNet Soup

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources