Spine MRI is central to diagnosing back pain, spinal stenosis, trauma, and tumors, but its interpretation is slow and complex, requiring radiologists to synthesize findings across multiple imaging sequences (T1-, T2-weighted, STIR, Dixon) and anatomical levels. SpineAgent is a multi-sequence spine-MRI foundation model and accompanying multi-agent system that learns transferable visual representations from routine clinical imaging and applies them across the full spectrum of interpretation tasks, from condition classification to draft report generation.

Developed by Zhiping Xiao, Nathan M. Cross, Sheng Wang, and colleagues at the University of Washington (with collaborators at Peking University, UW–Madison, and NYU) and posted to bioRxiv in June 2026, SpineAgent is built on a self-supervised foundation model pretrained on one of the largest spine-MRI corpora reported to date: 32,047 patients, 453,683 series, and 13.4 million slices from University of Washington Medicine. Its core is a pair of DINOv3-based Vision Transformer encoders trained separately on T1- and T2-weighted data, which produce fixed patient-level embeddings that are reused across many downstream agents.

By decomposing radiology reporting into clinically grounded subtasks, each handled by a specialized agent that draws on the shared encoders, SpineAgent demonstrates that a single imaging foundation model can generalize across manufacturers and external cohorts, a recurring challenge for medical-imaging deep learning.

Key Features

Paired DINOv3 encoders: Two Vision Transformers are pretrained independently with the DINOv3 self-supervised objective (with gram anchoring) on T1- and T2-weighted slices, and a lightweight synthesizer module learns layer-wise fusion to adapt to other sequence types (STIR, Dixon) via continual training.
17-condition classification: Patient-level embeddings drive classification across 17 spinal conditions spanning degenerative changes, alignment abnormalities, lesions/masses, trauma/compression, and canal or foraminal narrowing, with labels distilled from clinical reports under a presence/absence/ambiguity scheme.
Pathology localization: A two-phase pipeline first selects clinically relevant slices per condition, then regresses bounding boxes to localize pathology, evaluated against expert-annotated key slices on an RSNA spine-MRI benchmark.
Multimodal retrieval: Image-to-text and text-to-image retrieval agents align case-level MRI embeddings with report embeddings via CLIP-style training using a BiomedBERT text encoder, enabling case lookup and concordant-slice retrieval.
Draft report generation: Attention-pooled image tokens are concatenated with text tokens and passed to a LLaMA-3.1-8B decoder to produce a draft radiology report, integrating visual features with structured semantic signals.

Technical Details

SpineAgent pretrains its two ViT encoders with DINOv3 on roughly 4.5 million T1 and 4.5 million T2 slices each, then aligns image and text representations through a CLIP-style stage using a BiomedBERT language encoder. For inference, slice-level embeddings from the sequence-specific encoders (or the synthesizer for other sequences) are concatenated and aggregated by an attention-pooling projector into a fixed set of patient-level image tokens. Across the 17 classification tasks, SpineAgent improves AUROC by 10.8% over the strongest baselines (with a 13.4% AUPRC gain) when using all available sequences, and continual training of the synthesizer yields an 11.1% AUROC improvement on non-T1/T2 sequences. On retrieval, it achieves a 56.4% relative improvement in Recall@5 on the UW Medicine dataset over the next-best method. Cross-manufacturer evaluation (training on one scanner vendor, testing across all) and cross-cohort evaluation on the external RSNA LumbarDISC cohort both show consistent gains, evidence of robustness to scanner and population shift.

Applications

SpineAgent is aimed at radiology workflows for spine imaging: it can triage and classify spinal pathology, highlight the slices and regions most relevant to a suspected condition, retrieve similar prior cases or matching reports, and generate a structured draft report to accelerate read times. Its patient-level embeddings also serve as a reusable representation for researchers building downstream spine-MRI models without retraining an encoder from scratch. Because the encoders generalize across manufacturers and to an external cohort, the system is relevant to multi-site clinical and research settings where imaging hardware and protocols vary.

Impact

SpineAgent shows that self-supervised foundation modeling on large routine clinical corpora can unify spine-MRI interpretation tasks that were previously addressed by separate, narrowly trained models, while generalizing across the manufacturer and cohort shifts that often degrade medical imaging models. The training pipeline (DINOv3 encoders, CLIP alignment, synthesizer routing, and the report-generation stack) is released under Apache-2.0 on GitHub. However, the model is a June 2026 bioRxiv preprint and has not yet been peer reviewed; the underlying clinical imaging data cannot be shared for privacy reasons, and pretrained weights are not yet publicly available, which currently limits external reproducibility and direct clinical deployment.

Key Features

Paired DINOv3 encoders: Two Vision Transformers are pretrained independently with the DINOv3 self-supervised objective (with gram anchoring) on T1- and T2-weighted slices, and a lightweight synthesizer module learns layer-wise fusion to adapt to other sequence types (STIR, Dixon) via continual training.

17-condition classification: Patient-level embeddings drive classification across 17 spinal conditions spanning degenerative changes, alignment abnormalities, lesions/masses, trauma/compression, and canal or foraminal narrowing, with labels distilled from clinical reports under a presence/absence/ambiguity scheme.

Pathology localization: A two-phase pipeline first selects clinically relevant slices per condition, then regresses bounding boxes to localize pathology, evaluated against expert-annotated key slices on an RSNA spine-MRI benchmark.

Multimodal retrieval: Image-to-text and text-to-image retrieval agents align case-level MRI embeddings with report embeddings via CLIP-style training using a BiomedBERT text encoder, enabling case lookup and concordant-slice retrieval.

Draft report generation: Attention-pooled image tokens are concatenated with text tokens and passed to a LLaMA-3.1-8B decoder to produce a draft radiology report, integrating visual features with structured semantic signals.

Technical Details

Applications

Impact

SpineAgent

Key Features

Technical Details

Applications

Impact

Citation

A multi-agent system for spine MRI report generation from multi-sequence imaging

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

SpineAgent

Key Features

Technical Details

Applications

Impact

Citation

A multi-agent system for spine MRI report generation from multi-sequence imaging

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

SpineAgent

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multi-agent system for spine MRI report generation from multi-sequence imaging

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

SpineAgent

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multi-agent system for spine MRI report generation from multi-sequence imaging

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact