bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

SpatialDINO

Harvard Medical School

A native 3D vision transformer self-supervised on unlabeled fluorescence microscopy volumes that generalizes to unseen object classes without retraining or voxel annotations.

Released: December 2025

SpatialDINO is a self-supervised 3D vision foundation model for fluorescence microscopy, developed in Tom Kirchhausen's laboratory at Harvard Medical School and released as a preprint in December 2025. It tackles a persistent bottleneck in volumetric bioimage analysis: deep-learning models for detecting, segmenting, and tracking subcellular structures in 3D typically require dense, manually drawn voxel annotations, which are extraordinarily labor-intensive to produce and tie a trained model to the specific object classes it was supervised on. SpatialDINO sidesteps this by learning general-purpose 3D representations from unlabeled microscopy volumes alone.

The model adapts the DINOv2 self-supervised learning framework into a native 3D vision transformer that operates directly on image volumes rather than slice-by-slice. By training on raw, unannotated fluorescence microscopy data, SpatialDINO learns features that capture the geometry and appearance of subcellular structures without ever being told what those structures are. The central claim of the work is generalization: a single pretrained model transfers to object classes it never saw during training — including plasma membranes, nuclei, and even tumors in MRI volumes — without any retraining or voxel-level labels.

SpatialDINO sits within the broader movement to bring self-supervised vision foundation models, which have reshaped natural-image analysis, into volumetric biological imaging. Its emphasis on native 3D processing and annotation-free transfer distinguishes it from segmentation tools that depend on task-specific supervised training.

#Key Features

  • Native 3D vision transformer: SpatialDINO operates directly on image volumes rather than processing 2D slices independently, preserving the volumetric context essential for accurate analysis of 3D subcellular structures.
  • Self-supervised, annotation-free pretraining: Built on the DINOv2 framework, the model learns from unlabeled fluorescence microscopy volumes, eliminating the need for the dense voxel annotations that conventional 3D segmentation models require.
  • Generalization to unseen object classes: A single pretrained model transfers to structures absent from training — plasma membranes, nuclei, and MRI tumors — without retraining, demonstrating learned features that are not tied to specific labels.
  • Multi-task downstream use: The learned representations support detection, segmentation, and tracking of subcellular structures in 3D from one shared backbone.

#Technical Details

SpatialDINO extends the DINOv2 self-supervised paradigm into a native 3D vision transformer trained on unlabeled fluorescence microscopy volumes. The architecture processes volumetric data end-to-end, producing features that can be applied to detection, segmentation, and tracking tasks downstream. A notable practical point is the scale of pretraining: the authors describe training on a relatively modest corpus of confocal volumes, yet report transfer to object classes and even imaging modalities (MRI) outside the training distribution. This data efficiency, combined with the absence of voxel-level annotation, is the model's principal contribution. Full architectural parameters and quantitative benchmark results are detailed in the preprint and should be confirmed against the peer-reviewed version.

#Applications

SpatialDINO is aimed at cell biologists and imaging scientists who work with 3D fluorescence microscopy and need to detect, segment, or track subcellular structures without investing in extensive manual annotation. Because the model generalizes to unseen object classes, a single pretrained backbone can be reused across diverse experiments — different organelles, cell types, or even other volumetric modalities such as MRI — lowering the cost of analysis for labs that cannot produce large annotated training sets. This makes it well suited to exploratory imaging studies and high-content volumetric assays.

#Impact

By demonstrating that a self-supervised 3D vision transformer trained on modest, unlabeled microscopy data can generalize across object classes and modalities, SpatialDINO points toward annotation-free foundation models for volumetric bioimaging — an area where labeling costs have long limited the reach of deep learning. Its broader influence will depend on independent evaluation and uptake by the imaging community. Availability is a limitation to note: the preprint does not provide public code or model weights, it is released under a non-commercial, no-derivatives license, and the pretraining corpus is small, so the breadth of generalization beyond the reported cases remains to be established by others.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
8Closed
Usability — can I run it?7
Reproducibility — can I retrain it?10
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

segmentationobject_detectioncell_trackingvision_transformerself_supervisedfoundation_modelrepresentation_learningmicroscopycell_biology

Resources

Research Paper