A native 3D vision transformer self-supervised on unlabeled fluorescence microscopy volumes that generalizes to unseen object classes without retraining or voxel annotations.
SpatialDINO is a self-supervised 3D vision foundation model for fluorescence microscopy, developed in Tom Kirchhausen's laboratory at Harvard Medical School and released as a preprint in December 2025. It tackles a persistent bottleneck in volumetric bioimage analysis: deep-learning models for detecting, segmenting, and tracking subcellular structures in 3D typically require dense, manually drawn voxel annotations, which are extraordinarily labor-intensive to produce and tie a trained model to the specific object classes it was supervised on. SpatialDINO sidesteps this by learning general-purpose 3D representations from unlabeled microscopy volumes alone.
The model adapts the DINOv2 self-supervised learning framework into a native 3D vision transformer that operates directly on image volumes rather than slice-by-slice. By training on raw, unannotated fluorescence microscopy data, SpatialDINO learns features that capture the geometry and appearance of subcellular structures without ever being told what those structures are. The central claim of the work is generalization: a single pretrained model transfers to object classes it never saw during training — including plasma membranes, nuclei, and even tumors in MRI volumes — without any retraining or voxel-level labels.
SpatialDINO sits within the broader movement to bring self-supervised vision foundation models, which have reshaped natural-image analysis, into volumetric biological imaging. Its emphasis on native 3D processing and annotation-free transfer distinguishes it from segmentation tools that depend on task-specific supervised training.
SpatialDINO extends the DINOv2 self-supervised paradigm into a native 3D vision transformer trained on unlabeled fluorescence microscopy volumes. The architecture processes volumetric data end-to-end, producing features that can be applied to detection, segmentation, and tracking tasks downstream. A notable practical point is the scale of pretraining: the authors describe training on a relatively modest corpus of confocal volumes, yet report transfer to object classes and even imaging modalities (MRI) outside the training distribution. This data efficiency, combined with the absence of voxel-level annotation, is the model's principal contribution. Full architectural parameters and quantitative benchmark results are detailed in the preprint and should be confirmed against the peer-reviewed version.
SpatialDINO is aimed at cell biologists and imaging scientists who work with 3D fluorescence microscopy and need to detect, segment, or track subcellular structures without investing in extensive manual annotation. Because the model generalizes to unseen object classes, a single pretrained backbone can be reused across diverse experiments — different organelles, cell types, or even other volumetric modalities such as MRI — lowering the cost of analysis for labs that cannot produce large annotated training sets. This makes it well suited to exploratory imaging studies and high-content volumetric assays.
By demonstrating that a self-supervised 3D vision transformer trained on modest, unlabeled microscopy data can generalize across object classes and modalities, SpatialDINO points toward annotation-free foundation models for volumetric bioimaging — an area where labeling costs have long limited the reach of deep learning. Its broader influence will depend on independent evaluation and uptake by the imaging community. Availability is a limitation to note: the preprint does not provide public code or model weights, it is released under a non-commercial, no-derivatives license, and the pretraining corpus is small, so the breadth of generalization beyond the reported cases remains to be established by others.