Chan Zuckerberg Initiative
Eighth-place CZII CryoET Kaggle solution; a weighted model soup of tiny, medium, and large 3D U-Nets pretrained on simulated data and fine-tuned on experimental cryo-ET tomograms.
Ensemble 3D UNet Soup is the eighth-place solution from the CZII CryoET Object Identification Kaggle competition organized by the Chan Zuckerberg Imaging Institute (CZII) from November 2024 to February 2025. The model is now hosted on the CZ Virtual Cells Platform (v1.0) as part of CZII's effort to make top-performing particle picking algorithms accessible to the structural biology community through the copick ecosystem.
The approach takes its name from the "model soup" technique, in which multiple independently trained models are combined through weighted averaging of their parameters or predictions to produce an ensemble that consistently outperforms any single constituent. Applied to the 3D U-Net architecture family, this strategy trains three model variants at different scales (tiny, medium, and large), pretrains each on the available simulated cryo-ET data to establish strong baseline representations, and then fine-tunes each variant on the experimental annotated tomograms. The final ensemble combines these three models with test-time augmentation, producing a robust particle picker that benefits from the complementary inductive biases of different model capacities.
The CZII competition attracted 1,135 participants and challenged them to automatically detect six protein complexes in cryo-ET volumes: apoferritin, beta-amylase, beta-galactosidase, 80S ribosomes, thyroglobulin, and virus-like particles. The eighth-place result demonstrates that a careful application of established deep learning techniques — model soups, pretraining on simulation, and test-time augmentation — can achieve competitive performance on a specialized biological imaging task without requiring architectural novelty.
The core architecture is the 3D U-Net, implemented with three scale variants differentiated by the number of encoder-decoder levels and feature channel counts. The "tiny" variant uses shallower encoding with fewer channels, providing fast inference at the cost of some spatial context; the "medium" variant follows standard U-Net depth; and the "large" variant uses additional encoder levels to capture coarser contextual features at the expense of memory and compute. All three variants output multi-channel 3D heatmaps, one channel per target particle class, trained with a regression loss against Gaussian-blurred ground truth coordinate maps.
Training patches of 128×128×128 voxels are extracted from full tomographic volumes during both pretraining and fine-tuning, with standard augmentations including random axis-aligned flips and small intensity perturbations. The simulated pretraining phase uses all 27 simulated runs from the CZII Kaggle phantom dataset, which were generated with realistic cryo-ET noise models, missing wedge geometry, and particle packing densities calibrated to match experimental conditions. Fine-tuning adapts the pretrained weights to the 6 experimental tomograms, training separate ensemble members on different preprocessed versions of these tomograms — denoised (via IsoNet or similar), CTF-deconvolved, and unprocessed — so that the ensemble spans the preprocessing variation present in real experimental data. The final model soup weights are determined by grid search on the validation set.
Ensemble 3D UNet Soup is most directly applicable to cryo-ET subtomogram averaging pipelines targeting the six protein complexes in the CZII phantom benchmark. Because the model was systematically trained across multiple preprocessing variants, it may exhibit better robustness than single-preprocessing pickers when applied to experimental datasets with variable data quality or when users are uncertain about the optimal preprocessing strategy for their instrument conditions. The model soup approach also provides a principled baseline for practitioners who want to extend the method to new targets: training a new set of U-Nets at multiple scales on a new annotated dataset and applying the same soup aggregation strategy is straightforward using the published code. The copick-compatible output format allows picked coordinates to feed directly into CZ cryo-ET Data Portal workflows for collaborative visualization and quality assessment.
The Ensemble 3D UNet Soup illustrates the effectiveness of combining classical deep learning engineering practices — model soups, pretraining on simulation, test-time augmentation — in the context of cryo-ET particle picking, achieving eighth place among 1,135 competition entrants without requiring novel architectural components. Its inclusion on the CZ Virtual Cells Platform gives users an alternative to the higher-ranked TopCUP model, providing a different ensemble strategy (multiple U-Net scales vs. EfficientNet encoder variants) that may perform better or worse depending on the specific target complex and data characteristics. As of early 2026, the approach has not been described in a standalone peer-reviewed publication; the CZII competition outcomes were summarized in an accompanying preprint covering lessons from the challenge. Users should fine-tune on their own annotated data when applying to complexes outside the six phantom benchmark targets.