SpatialDINO

Native 3D vision transformer self-supervised on unlabeled fluorescence microscopy volumes, segmenting subcellular structures without voxel labels.

Released: December 2025

SpatialDINO is a self-supervised 3D vision foundation model for fluorescence microscopy, developed in Tom Kirchhausen's laboratory at Harvard Medical School and released as a preprint in December 2025. It tackles a persistent bottleneck in volumetric bioimage analysis: deep-learning models for detecting, segmenting, and tracking subcellular structures in 3D typically require dense, manually drawn voxel annotations, which are extraordinarily labor-intensive to produce and tie a trained model to the specific object classes it was supervised on. SpatialDINO sidesteps this by learning general-purpose 3D representations from unlabeled microscopy volumes alone.

The model adapts the DINOv2 self-supervised learning framework into a native 3D vision transformer that operates directly on image volumes rather than slice-by-slice. By training on raw, unannotated fluorescence microscopy data, SpatialDINO learns features that capture the geometry and appearance of subcellular structures without ever being told what those structures are. The central claim of the work is generalization: a single pretrained model transfers to object classes it never saw during training — including plasma membranes, nuclei, and even tumors in MRI volumes — without any retraining or voxel-level labels.

SpatialDINO sits within the broader movement to bring self-supervised vision foundation models, which have reshaped natural-image analysis, into volumetric biological imaging. Its emphasis on native 3D processing and annotation-free transfer distinguishes it from segmentation tools that depend on task-specific supervised training.

Key Features

Native 3D vision transformer: SpatialDINO operates directly on image volumes rather than processing 2D slices independently, preserving the volumetric context essential for accurate analysis of 3D subcellular structures.
Self-supervised, annotation-free pretraining: Built on the DINOv2 framework, the model learns from unlabeled fluorescence microscopy volumes, eliminating the need for the dense voxel annotations that conventional 3D segmentation models require.
Generalization to unseen object classes: A single pretrained model transfers to structures absent from training — plasma membranes, nuclei, and MRI tumors — without retraining, demonstrating learned features that are not tied to specific labels.
Multi-task downstream use: The learned representations support detection, segmentation, and tracking of subcellular structures in 3D from one shared backbone.

Technical Details

SpatialDINO extends the DINOv2 self-supervised paradigm into a native 3D vision transformer trained on unlabeled fluorescence microscopy volumes. The architecture processes volumetric data end-to-end, producing features that can be applied to detection, segmentation, and tracking tasks downstream. A notable practical point is the scale of pretraining: the authors describe training on a relatively modest corpus of confocal volumes, yet report transfer to object classes and even imaging modalities (MRI) outside the training distribution. This data efficiency, combined with the absence of voxel-level annotation, is the model's principal contribution. Full architectural parameters and quantitative benchmark results are detailed in the preprint and should be confirmed against the peer-reviewed version.

Applications

SpatialDINO is aimed at cell biologists and imaging scientists who work with 3D fluorescence microscopy and need to detect, segment, or track subcellular structures without investing in extensive manual annotation. Because the model generalizes to unseen object classes, a single pretrained backbone can be reused across diverse experiments — different organelles, cell types, or even other volumetric modalities such as MRI — lowering the cost of analysis for labs that cannot produce large annotated training sets. This makes it well suited to exploratory imaging studies and high-content volumetric assays.

Impact

By demonstrating that a self-supervised 3D vision transformer trained on modest, unlabeled microscopy data can generalize across object classes and modalities, SpatialDINO points toward annotation-free foundation models for volumetric bioimaging — an area where labeling costs have long limited the reach of deep learning. Its broader influence will depend on independent evaluation and uptake by the imaging community. Availability is a limitation to note: the preprint does not provide public code or model weights, it is released under a non-commercial, no-derivatives license, and the pretraining corpus is small, so the breadth of generalization beyond the reported cases remains to be established by others.

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Lavaee, A., et al. (2026) SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments. bioRxiv.

DOI: 10.64898/2025.12.31.697247

Recent citations

Papers that recently cited this model.

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?
Caterina Fuster-Barcel'o, V. Uhlmann
arXiv.org · Feb 2026
0

Top citations

The most-cited papers that cite this model.

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?
Caterina Fuster-Barcel'o, V. Uhlmann
arXiv.org · Feb 2026
0

Citations

Total Citations1

Influential0

References0

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

8Closed

Usability — can I run it?7

Reproducibility — can I retrain it?10

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Native 3D vision transformer: SpatialDINO operates directly on image volumes rather than processing 2D slices independently, preserving the volumetric context essential for accurate analysis of 3D subcellular structures.

Self-supervised, annotation-free pretraining: Built on the DINOv2 framework, the model learns from unlabeled fluorescence microscopy volumes, eliminating the need for the dense voxel annotations that conventional 3D segmentation models require.

Generalization to unseen object classes: A single pretrained model transfers to structures absent from training — plasma membranes, nuclei, and MRI tumors — without retraining, demonstrating learned features that are not tied to specific labels.

Multi-task downstream use: The learned representations support detection, segmentation, and tracking of subcellular structures in 3D from one shared backbone.

Technical Details

Applications

Impact

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Lavaee, A., et al. (2026) SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments. bioRxiv.

DOI: 10.64898/2025.12.31.697247

SpatialDINO

Key Features

Technical Details

Applications

Impact

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Recent citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Top citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Citations

Fields of citing research

Openness

Tags

Resources

SpatialDINO

Key Features

Technical Details

Applications

Impact

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Recent citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Top citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Citations

Fields of citing research

Openness

Tags

Resources

SpatialDINO

#Key Features

#Technical Details

#Applications

#Impact

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Recent citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Top citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Related models

Citations

Fields of citing research

Openness

Tags

Resources

SpatialDINO

#Key Features

#Technical Details

#Applications

#Impact

Citation

SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments

Recent citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Top citations

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact