Brainfound

Tsinghua University / Chinese PLA General Hospital / Beijing Tiantan Hospital

Multimodal vision-text foundation model for brain CT and MRI, pretrained on roughly 10 million image-report pairs to act as a clinical copilot.

Released: January 2025

Brainfound is a multimodal vision-text foundation model for brain imaging that functions as an interactive clinical copilot across both CT and MRI. Rather than training a separate network for each clinical task, Brainfound learns shared representations spanning brain CT, brain MRI, and the radiology reports that accompany them, then applies that single foundation to a broad spectrum of work—from low-level image enhancement up to high-level report generation and free-form human-AI dialogue. It targets the gap between narrow, single-task medical imaging models and the generalist assistants clinicians increasingly want at the point of care.

The model was introduced in a January 2025 medRxiv preprint, "A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging," by Guoxun Zhang, Yuchen Guo, Xin Lou, Qionghai Dai, and colleagues. The work is a collaboration led by Tsinghua University together with the Chinese PLA General Hospital and Beijing Tiantan Hospital (Capital Medical University), pairing a strong computational-imaging group with two major neuroimaging clinical centers. A peer-reviewed version was subsequently published in Cell Reports Medicine.

Brainfound's central design choice is to combine generative image modeling with image-text alignment: a diffusion-based visual module learns to model and enhance brain images, while contrastive learning aligns that visual representation with paired clinical text. This pairing lets one backbone both generate and reason over images and language, which is what enables zero-shot classification and conversational use without task-specific retraining.

Key Features

Vision-text copilot: Accepts flexible image and text input and produces image or text output, supporting free human-AI conversation about brain scans rather than a single fixed prediction.
CT and MRI in one model: A single foundation handles both brain CT and brain MRI, learning a shared representation across modalities and their paired reports.
Seven downstream tasks: Covers brain disease diagnosis, lesion segmentation, MRI enhancement, cross-modality translation, automatic report generation, zero-shot disease classification, and human-AI dialogue.
Diffusion plus contrastive alignment: Built on a diffusion-based generative framework with image-text contrastive learning aligning the visual and language modules.
Zero-shot capability: Image-text alignment enables zero-shot brain disease classification without additional task-specific labels.

Technical Details

Brainfound was pretrained on a large multimodal corpus—over 3 million brain CT images and over 7 million brain MRI images, each with paired clinical reports (roughly 10 million image-report pairs in total). Its architecture combines a diffusion-based generative visual module with a language module, the two aligned by contrastive learning so that visual features and report text share a common embedding space. The authors report state-of-the-art results across the seven evaluated tasks: in automatic report generation for brain imaging it exceeded the prior leading model by 51.75%, and on brain-imaging multiple-choice questions it outperformed GPT-4V by 47.68%, with diagnostic performance approaching that of expert physicians on the evaluated benchmarks. Three experienced radiologists from three hospitals participated in labeling and evaluation.

Applications

Brainfound is aimed at radiologists and clinical-imaging teams who need a single assistant that can read brain CT and MRI, draft structured reports, answer imaging questions, enhance or translate scans between contrasts, and segment lesions. Because it accepts mixed image and text input and supports dialogue, it can slot into reporting workflows as a drafting and question-answering aid, support triage and second-read scenarios, and enable zero-shot classification of conditions not seen during fine-tuning—particularly valuable in neuroimaging settings where annotated data is scarce.

Impact

Brainfound is an early example of a generalist, conversational foundation model purpose-built for a single organ system's imaging, demonstrating that combining diffusion-based generation with image-text contrastive learning can unify enhancement, segmentation, diagnosis, and report generation in one brain-imaging model. Its reported margins over strong baselines, including a large language-vision model on brain-imaging QA, and its progression from medRxiv preprint to publication in Cell Reports Medicine, signal growing interest in clinical copilots for radiology. As a clinically oriented model, its real-world value still depends on prospective, multi-site validation, and the training corpus and weights are not openly released, which limits independent reproduction.

Citation

A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging

Preprint

Zhang, G., et al. (2025) A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging. medRxiv.

DOI: 10.1101/2025.01.09.25320293

Recent citations

Papers that recently cited this model.

NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports
st R Saravanakumar
2026 International Conference on Emerging Smart Computing and Informatics (ESCI) · Mar 2026
0
Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research
Salah Ghamizi, G. Kanli, Yu Deng, et al.
arXiv.org · Jun 2025
2
Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI
Satmyrza Mamikov, Zhansaya Yakhiya, Bauyrzhan Omarov, et al.
International Journal of Advanced Computer Science and Applications · 2025
1

Top citations

The most-cited papers that cite this model.

Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research
Salah Ghamizi, G. Kanli, Yu Deng, et al.
arXiv.org · Jun 2025
2
Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI
Satmyrza Mamikov, Zhansaya Yakhiya, Bauyrzhan Omarov, et al.
International Journal of Advanced Computer Science and Applications · 2025
1
NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports
st R Saravanakumar
2026 International Conference on Emerging Smart Computing and Informatics (ESCI) · Mar 2026
0

Citations

Total Citations4

Influential0

References0

Fields of citing research

Computer Science100%
Medicine100%
Engineering67%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

7Closed

Usability — can I run it?9

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper Research Paper

Key Features

Vision-text copilot: Accepts flexible image and text input and produces image or text output, supporting free human-AI conversation about brain scans rather than a single fixed prediction.

CT and MRI in one model: A single foundation handles both brain CT and brain MRI, learning a shared representation across modalities and their paired reports.

Seven downstream tasks: Covers brain disease diagnosis, lesion segmentation, MRI enhancement, cross-modality translation, automatic report generation, zero-shot disease classification, and human-AI dialogue.

Diffusion plus contrastive alignment: Built on a diffusion-based generative framework with image-text contrastive learning aligning the visual and language modules.

Zero-shot capability: Image-text alignment enables zero-shot brain disease classification without additional task-specific labels.

Technical Details

Applications

Impact

Brainfound

Key Features

Technical Details

Applications

Impact

Citation

A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging

Recent citations

NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports

Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research

Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI

Top citations

Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research

Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI

NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports

Citations

Fields of citing research

Openness

Tags

Resources

Brainfound

Key Features

Technical Details

Applications

Impact

Citation

A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging

Recent citations

NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports

Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research

Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI

Top citations

Brain Imaging Foundation Models, Are We There Yet? A Systematic Review of Foundation Models for Brain Imaging and Biomedical Research

Feature Pyramid Network with Dual-Decoder Supervision for Accurate Stroke Lesion Localization in Multi-Modal Brain MRI

NeuroCLIP: Vision-Language Contrastive Learning for Medical Alzheimer’s Diseases Diagnosis from PET and Text Reports

Citations

Fields of citing research

Openness

Tags

Resources

Brainfound

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Brainfound

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact