ShanghaiTech University / Hainan University / United Imaging Intelligence
Text-guided universal MRI synthesis generalist that generates customized brain MR sequences and resolutions from routine scans using imaging-metadata text prompts.
TUMSyn (Text-guided Universal MR image Synthesis) is a generalist model for cross-sequence brain MRI synthesis. Clinical MRI protocols routinely acquire only a subset of the modalities that downstream diagnosis or analysis might require, and re-scanning patients to obtain missing contrasts (for example, FLAIR, T2-weighted, or a higher spatial resolution) is costly, slow, and sometimes infeasible. TUMSyn addresses this by generating MR images with on-demand imaging characteristics from already-acquired scans, steered by natural-language descriptions of the desired output.
The model was developed by researchers at ShanghaiTech University, Hainan University, and United Imaging Intelligence, led by Dinggang Shen's medical imaging AI group, with an initial preprint in September 2024 and the peer-reviewed version published in Cell Reports Medicine in 2025. Its central novelty is using imaging metadata (modality, repetition time, echo time, voxel spacing, magnetic field strength, and similar acquisition parameters) as a text prompt, allowing a single unified model to flexibly target a continuous space of contrasts and resolutions rather than being locked to a fixed set of source-to-target sequence pairs.
By framing synthesis as text-conditioned generation across 7 structural MRI modalities and 13 acquisition centers, TUMSyn positions itself as a foundation-style tool for medical image translation, applicable in both supervised settings and zero-shot scenarios where the requested acquisition parameters were never seen during training.
TUMSyn uses a two-stage design totaling roughly 114 million parameters. Stage one pre-trains an MRI-specific text encoder via CLIP-style contrastive learning: a ViT-B/16-based image encoder (adapted for MR images) and a Transformer text encoder with byte-pair-encoding tokenization are aligned so that imaging-metadata prompts map into a shared embedding space with image features. In stage two, the frozen text encoder produces prompt features that condition image synthesis. A 24-layer ResNet-style CNN encoder (without downsampling) extracts image features, a multi-head cross-attention module fuses text and image representations, and a LIIF decoder reconstructs the target image at arbitrary resolution. Training used a brain MR database of 31,407 3D images covering 7 structural modalities from 13 centers. On held-out benchmarks the model reports strong fidelity, including 28.79 dB PSNR / 0.967 SSIM on T1w→FLAIR (ABCD) and 26.85 dB PSNR / 0.960 SSIM on FLAIR→T1w (ADNI-2), outperforming baselines such as SC-GAN by up to 2.86 dB PSNR.
TUMSyn supports clinical and research imaging workflows where missing or low-resolution MR sequences would otherwise limit analysis. It can impute absent contrasts to complete multimodal protocols, super-resolve low-resolution acquisitions, and harmonize images across scanners and centers. The authors demonstrate clinical utility for brain disease screening in both supervised and zero-shot settings, and synthesized sequences can feed downstream segmentation, registration, and diagnostic pipelines, benefiting radiologists, neuroimaging researchers, and developers of medical image analysis tools.
By turning cross-sequence MRI synthesis into a single text-controllable model, TUMSyn moves medical image translation toward the prompt-driven, generalist paradigm increasingly common in foundation models. Pre-trained weights are publicly released via Zenodo alongside code, lowering the barrier for groups to apply or extend the approach. The work illustrates how acquisition metadata can serve as a flexible conditioning signal for imaging generative models, and its peer-reviewed publication in Cell Reports Medicine signals validation of synthetic-MRI utility for diagnostic tasks. As with all synthetic medical imaging, generated sequences carry the risk of hallucinated or smoothed pathology and require careful validation before any clinical use.
Wang, Y., et al. (2025) Toward general text-guided multimodal brain MRI synthesis for diagnosis and medical image analysis. Cell Reports Medicine.
DOI: 10.1016/j.xcrm.2025.102182Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data