Overview

MonjuDetectHM is the seventh-place solution from the CZII CryoET Object Identification Kaggle competition organized by the Chan Zuckerberg Imaging Institute (CZII) from November 2024 to February 2025. It is hosted on the CZ Virtual Cells Platform (v0.1.0) as part of CZII's effort to make competitive cryo-ET particle picking algorithms broadly accessible through the copick data ecosystem.

The model's name combines "Monju" — a reference to the Japanese god of wisdom — with "DetectHM," indicating its heatmap-based detection strategy. MonjuDetectHM is an ensemble of three 3D segmentation models that collectively predict the presence and 3D positions of protein complexes in cryo-ET tomograms by generating spatial heatmaps: each voxel receives a score proportional to the likelihood that a particle center lies nearby, with peak detection subsequently extracting discrete coordinate predictions. This soft heatmap representation is a well-established approach in keypoint detection that is more robust to overlap and partial visibility than direct bounding box regression.

Distinctive about MonjuDetectHM is its use of heterogeneous backbone architectures within the ensemble — combining two ResNet50d-based models with one EfficientNetV2-M model — which introduces architectural diversity alongside the standard benefit of averaging random seed variance. The three models were trained in two stages, first on simulated data and then fine-tuned on experimental tomograms from the CZII phantom dataset, achieving a final private leaderboard score of 0.77708 under the competition's F-beta metric (beta=4).

Key Features

Heterogeneous backbone ensemble: Combines two ResNet50d-based 3D models (using pretrained timm ra2_in1k weights) and one EfficientNetV2-M-based model (using pretrained timm_3d in21k_ft_in1k weights), providing architectural diversity in the ensemble beyond simple multi-seed averaging.
Heatmap-based keypoint detection: Each model predicts a continuous 3D heatmap encoding the probability that a particle center lies at each voxel location, with post-processing peak detection extracting candidate positions — a formulation robust to particle overlap and size variation.
Two-stage training: Trains first on 27 simulated tomographic runs from the CZII Kaggle dataset to learn general structural features from abundant data, then fine-tunes on experimental annotated tomograms to adapt to real cryo-ET noise and contrast characteristics.
Six-class multi-species detection: Predicts positions of five target protein complexes simultaneously — apoferritin, beta-galactosidase, 80S ribosome, thyroglobulin, and virus-like particle — with separate heatmap channels per class.
ImageNet and large-scale pretrained weights: Leverages both ImageNet-1k (ra2_in1k) and ImageNet-21k pretrained weights for 3D-inflated backbone initialization, providing strong feature initialization for the limited cryo-ET training data available.
F-beta optimized evaluation: Trained and tuned under the competition's F-beta metric with beta=4, which prioritizes recall over precision — the appropriate trade-off for particle picking where missing particles (false negatives) is more costly than spurious picks (false positives) for subsequent averaging.

Technical Details

MonjuDetectHM uses three independently trained 3D segmentation networks. Models 1 and 3 share a ResNet50d encoder architecture, with 2D convolutions inflated to 3D, initialized from timm ra2_in1k ImageNet-1k pretrained weights. Model 2 uses an EfficientNetV2-M encoder, initialized from timm_3d in21k_ft_in1k ImageNet-21k pretrained weights. All three encoders feed into a U-Net-style decoder that upsamples feature maps back to the input resolution and produces multi-channel heatmap outputs. The decoder architecture is consistent across all three models, ensuring that differences in ensemble behavior arise primarily from the encoder backbone rather than decoder design.

The training procedure follows a two-stage curriculum. In stage 1, all three models are trained from scratch (with pretrained encoder weights but randomly initialized decoders) on 27 simulated tomographic runs provided by the CZII competition dataset. These simulated runs have resolution 630×630×200 voxels and contain seven labeled particle types. In stage 2, the models are fine-tuned on the experimental phantom dataset consisting of annotated real cryo-ET tomograms. Ground truth heatmaps are generated by placing 3D Gaussian blobs centered at annotated particle positions, with Gaussian widths calibrated to the known approximate sizes of each target complex. Ensemble prediction is computed by averaging the three models' heatmap outputs, and particle centers are extracted from the averaged heatmap using 3D local maxima detection with a minimum distance constraint. The final private leaderboard score of 0.77708 ranks seventh among 1,135 entrants.

Applications

MonjuDetectHM is applicable to cryo-ET particle picking workflows targeting the five protein complexes it was trained to detect (apoferritin, beta-galactosidase, 80S ribosome, thyroglobulin, VLP). The heterogeneous backbone ensemble may offer improved robustness compared to homogeneous ensembles on datasets with unusual noise characteristics or non-standard acquisition parameters, since ResNet50d and EfficientNetV2-M encoders learn differently structured feature hierarchies that partially complement each other's failure modes. The recall-prioritized training under F-beta (beta=4) makes MonjuDetectHM well suited for subtomogram averaging workflows where maximizing particle yield is important — initial picks can be subject to downstream 2D or 3D classification to remove false positives. Copick-compatible output from the Virtual Cells Platform integration allows picked coordinates to be directly visualized in napari and shared through the CZ cryo-ET Data Portal.

Impact

MonjuDetectHM provides the cryo-ET community with a seventh-place competitive benchmark model that uses architectural heterogeneity — an underexplored dimension in cryo-ET model ensembles — to achieve strong performance. Its public availability on the CZ Virtual Cells Platform alongside the first-place TopCUP and fifth-place BPD models gives users a range of design philosophies to compare: EfficientNet-B3 encoder (TopCUP), shallow homogeneous U-Net ensemble (BPD), and heterogeneous backbone ensemble (MonjuDetectHM). As of early 2026, MonjuDetectHM has not been the subject of a standalone peer-reviewed publication, and its performance outside the six CZII phantom benchmark complexes has not been systematically characterized. Fine-tuning on new annotated data is recommended before applying to novel targets.

Overview

Key Features

Heterogeneous backbone ensemble: Combines two ResNet50d-based 3D models (using pretrained timm ra2_in1k weights) and one EfficientNetV2-M-based model (using pretrained timm_3d in21k_ft_in1k weights), providing architectural diversity in the ensemble beyond simple multi-seed averaging.

Heatmap-based keypoint detection: Each model predicts a continuous 3D heatmap encoding the probability that a particle center lies at each voxel location, with post-processing peak detection extracting candidate positions — a formulation robust to particle overlap and size variation.

Two-stage training: Trains first on 27 simulated tomographic runs from the CZII Kaggle dataset to learn general structural features from abundant data, then fine-tunes on experimental annotated tomograms to adapt to real cryo-ET noise and contrast characteristics.

Six-class multi-species detection: Predicts positions of five target protein complexes simultaneously — apoferritin, beta-galactosidase, 80S ribosome, thyroglobulin, and virus-like particle — with separate heatmap channels per class.

ImageNet and large-scale pretrained weights: Leverages both ImageNet-1k (ra2_in1k) and ImageNet-21k pretrained weights for 3D-inflated backbone initialization, providing strong feature initialization for the limited cryo-ET training data available.

F-beta optimized evaluation: Trained and tuned under the competition's F-beta metric with beta=4, which prioritizes recall over precision — the appropriate trade-off for particle picking where missing particles (false negatives) is more costly than spurious picks (false positives) for subsequent averaging.

Technical Details

Applications

Impact

MonjuDetectHM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

MonjuDetectHM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources