Tsinghua University / Chinese PLA General Hospital / Beijing Tiantan Hospital
A multimodal vision-text foundation model for brain CT and MRI, pretrained on ~10M paired images and reports to act as a clinical copilot across seven imaging tasks.
Brainfound is a multimodal vision-text foundation model for brain imaging that functions as an interactive clinical copilot across both CT and MRI. Rather than training a separate network for each clinical task, Brainfound learns shared representations spanning brain CT, brain MRI, and the radiology reports that accompany them, then applies that single foundation to a broad spectrum of work—from low-level image enhancement up to high-level report generation and free-form human-AI dialogue. It targets the gap between narrow, single-task medical imaging models and the generalist assistants clinicians increasingly want at the point of care.
The model was introduced in a January 2025 medRxiv preprint, "A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging," by Guoxun Zhang, Yuchen Guo, Xin Lou, Qionghai Dai, and colleagues. The work is a collaboration led by Tsinghua University together with the Chinese PLA General Hospital and Beijing Tiantan Hospital (Capital Medical University), pairing a strong computational-imaging group with two major neuroimaging clinical centers. A peer-reviewed version was subsequently published in Cell Reports Medicine.
Brainfound's central design choice is to combine generative image modeling with image-text alignment: a diffusion-based visual module learns to model and enhance brain images, while contrastive learning aligns that visual representation with paired clinical text. This pairing lets one backbone both generate and reason over images and language, which is what enables zero-shot classification and conversational use without task-specific retraining.
Brainfound was pretrained on a large multimodal corpus—over 3 million brain CT images and over 7 million brain MRI images, each with paired clinical reports (roughly 10 million image-report pairs in total). Its architecture combines a diffusion-based generative visual module with a language module, the two aligned by contrastive learning so that visual features and report text share a common embedding space. The authors report state-of-the-art results across the seven evaluated tasks: in automatic report generation for brain imaging it exceeded the prior leading model by 51.75%, and on brain-imaging multiple-choice questions it outperformed GPT-4V by 47.68%, with diagnostic performance approaching that of expert physicians on the evaluated benchmarks. Three experienced radiologists from three hospitals participated in labeling and evaluation.
Brainfound is aimed at radiologists and clinical-imaging teams who need a single assistant that can read brain CT and MRI, draft structured reports, answer imaging questions, enhance or translate scans between contrasts, and segment lesions. Because it accepts mixed image and text input and supports dialogue, it can slot into reporting workflows as a drafting and question-answering aid, support triage and second-read scenarios, and enable zero-shot classification of conditions not seen during fine-tuning—particularly valuable in neuroimaging settings where annotated data is scarce.
Brainfound is an early example of a generalist, conversational foundation model purpose-built for a single organ system's imaging, demonstrating that combining diffusion-based generation with image-text contrastive learning can unify enhancement, segmentation, diagnosis, and report generation in one brain-imaging model. Its reported margins over strong baselines, including a large language-vision model on brain-imaging QA, and its progression from medRxiv preprint to publication in Cell Reports Medicine, signal growing interest in clinical copilots for radiology. As a clinically oriented model, its real-world value still depends on prospective, multi-site validation, and the training corpus and weights are not openly released, which limits independent reproduction.
Zhang, G., et al. (2025) A Multimodal Vision-text AI Copilot for Brain Disease Diagnosis and Medical Imaging. medRxiv.
DOI: 10.1101/2025.01.09.25320293Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data