Chan Zuckerberg Initiative
Seventh-place CZII CryoET Kaggle solution; an ensemble of three heatmap-predicting 3D segmentation models using ResNet50d and EfficientNetV2-M backbones for particle picking.
MonjuDetectHM is the seventh-place solution from the CZII CryoET Object Identification Kaggle competition organized by the Chan Zuckerberg Imaging Institute (CZII) from November 2024 to February 2025. It is hosted on the CZ Virtual Cells Platform (v0.1.0) as part of CZII's effort to make competitive cryo-ET particle picking algorithms broadly accessible through the copick data ecosystem.
The model's name combines "Monju" — a reference to the Japanese god of wisdom — with "DetectHM," indicating its heatmap-based detection strategy. MonjuDetectHM is an ensemble of three 3D segmentation models that collectively predict the presence and 3D positions of protein complexes in cryo-ET tomograms by generating spatial heatmaps: each voxel receives a score proportional to the likelihood that a particle center lies nearby, with peak detection subsequently extracting discrete coordinate predictions. This soft heatmap representation is a well-established approach in keypoint detection that is more robust to overlap and partial visibility than direct bounding box regression.
Distinctive about MonjuDetectHM is its use of heterogeneous backbone architectures within the ensemble — combining two ResNet50d-based models with one EfficientNetV2-M model — which introduces architectural diversity alongside the standard benefit of averaging random seed variance. The three models were trained in two stages, first on simulated data and then fine-tuned on experimental tomograms from the CZII phantom dataset, achieving a final private leaderboard score of 0.77708 under the competition's F-beta metric (beta=4).
MonjuDetectHM uses three independently trained 3D segmentation networks. Models 1 and 3 share a ResNet50d encoder architecture, with 2D convolutions inflated to 3D, initialized from timm ra2_in1k ImageNet-1k pretrained weights. Model 2 uses an EfficientNetV2-M encoder, initialized from timm_3d in21k_ft_in1k ImageNet-21k pretrained weights. All three encoders feed into a U-Net-style decoder that upsamples feature maps back to the input resolution and produces multi-channel heatmap outputs. The decoder architecture is consistent across all three models, ensuring that differences in ensemble behavior arise primarily from the encoder backbone rather than decoder design.
The training procedure follows a two-stage curriculum. In stage 1, all three models are trained from scratch (with pretrained encoder weights but randomly initialized decoders) on 27 simulated tomographic runs provided by the CZII competition dataset. These simulated runs have resolution 630×630×200 voxels and contain seven labeled particle types. In stage 2, the models are fine-tuned on the experimental phantom dataset consisting of annotated real cryo-ET tomograms. Ground truth heatmaps are generated by placing 3D Gaussian blobs centered at annotated particle positions, with Gaussian widths calibrated to the known approximate sizes of each target complex. Ensemble prediction is computed by averaging the three models' heatmap outputs, and particle centers are extracted from the averaged heatmap using 3D local maxima detection with a minimum distance constraint. The final private leaderboard score of 0.77708 ranks seventh among 1,135 entrants.
MonjuDetectHM is applicable to cryo-ET particle picking workflows targeting the five protein complexes it was trained to detect (apoferritin, beta-galactosidase, 80S ribosome, thyroglobulin, VLP). The heterogeneous backbone ensemble may offer improved robustness compared to homogeneous ensembles on datasets with unusual noise characteristics or non-standard acquisition parameters, since ResNet50d and EfficientNetV2-M encoders learn differently structured feature hierarchies that partially complement each other's failure modes. The recall-prioritized training under F-beta (beta=4) makes MonjuDetectHM well suited for subtomogram averaging workflows where maximizing particle yield is important — initial picks can be subject to downstream 2D or 3D classification to remove false positives. Copick-compatible output from the Virtual Cells Platform integration allows picked coordinates to be directly visualized in napari and shared through the CZ cryo-ET Data Portal.
MonjuDetectHM provides the cryo-ET community with a seventh-place competitive benchmark model that uses architectural heterogeneity — an underexplored dimension in cryo-ET model ensembles — to achieve strong performance. Its public availability on the CZ Virtual Cells Platform alongside the first-place TopCUP and fifth-place BPD models gives users a range of design philosophies to compare: EfficientNet-B3 encoder (TopCUP), shallow homogeneous U-Net ensemble (BPD), and heterogeneous backbone ensemble (MonjuDetectHM). As of early 2026, MonjuDetectHM has not been the subject of a standalone peer-reviewed publication, and its performance outside the six CZII phantom benchmark complexes has not been systematically characterized. Fine-tuning on new annotated data is recommended before applying to novel targets.