MedBLIP

Vision-language framework for 3D medical image diagnosis and visual question answering, bridging frozen image encoders and LLMs, shown on brain MRI.

Released: May 2023

MedBLIP is a vision-language pre-training (VLP) framework for computer-aided diagnosis (CAD) that operates on 3D medical images paired with text, such as electronic health records. Introduced in May 2023 by Qiuhui Chen, Xinyue Hu, Zirui Wang, and Yi Hong at Shanghai Jiao Tong University, it adapts the BLIP-2 "bootstrapping language-image pre-training" recipe—originally built for 2D natural images—to the volumetric data and limited annotation budgets typical of clinical imaging. The work was later published at ACCV 2024.

The central challenge MedBLIP addresses is that the strongest pre-trained image encoders and large language models are 2D- and text-native, while diagnostic scans like brain MRI are 3D. Rather than train a large multimodal model from scratch on scarce labeled medical data, MedBLIP keeps both the image encoder and the language model frozen and learns only a lightweight bridging module. This makes the system inexpensive to train while still producing a model that can answer diagnostic questions and assign disease labels in a zero-shot setting.

MedBLIP sits within the wave of medical vision-language foundation models that followed CLIP and BLIP-2, and is notable as one of the early efforts to handle genuinely 3D volumes rather than 2D slices or single radiographs, with a focus on Alzheimer's disease staging from structural brain MRI.

Key Features

MedQFormer bridging module: A query-transformer adapter that maps 3D image volumes into the embedding space of frozen 2D image encoders and language models, the only component trained during pre-training.
Frozen-backbone efficiency: Both the pre-trained vision encoder and the LLM remain frozen, dramatically reducing trainable parameters and the labeled data needed compared to end-to-end multimodal training.
Zero-shot classification: Distinguishes healthy controls, mild cognitive impairment (MCI), and Alzheimer's disease directly from scans plus subject text, without task-specific fine-tuning.
Medical visual question answering: Beyond categorical labels, the model answers free-form diagnostic questions about a scan, leveraging the language model's generative capabilities.
EHR-aware multimodal input: Combines image volumes with structured or textual patient information, reflecting how clinicians integrate scans with records.

Technical Details

MedBLIP follows the two-stream BLIP-2 design: a frozen image encoder, a frozen language model, and a trainable Querying Transformer (here, MedQFormer) that connects them. MedQFormer uses learnable query tokens to extract a fixed set of visual features from a 3D volume and project them into the language model's input space, so the LLM can condition its text generation on imaging evidence. The authors pair the framework with off-the-shelf encoders and LLMs rather than bespoke architectures, emphasizing reuse of existing foundation models.

The model was pre-trained on more than 30,000 image volumes aggregated from five public Alzheimer's disease cohorts—ADNI, NACC, OASIS, AIBL, and MIRIAD—covering structural brain MRI across the cognitive-decline spectrum. On the three-way healthy / MCI / Alzheimer's classification task, MedBLIP reports state-of-the-art zero-shot performance relative to the baselines evaluated, and it additionally demonstrates qualitative medical VQA. Code is released under the MIT license; the repository builds on the BLIP-2 weights distributed through Salesforce LAVIS.

Applications

MedBLIP targets computer-aided diagnosis workflows where a model must reason jointly over a 3D scan and accompanying clinical text. Its most developed use case is neurodegeneration screening—staging Alzheimer's disease and detecting mild cognitive impairment from brain MRI—useful to radiologists, neurologists, and researchers running large imaging cohorts. The medical VQA capability also makes it a building block for interactive diagnostic assistants and for triage tools that surface candidate findings, while the frozen-backbone design lowers the barrier for groups without the data or compute to train large multimodal models from scratch.

Impact

MedBLIP helped demonstrate that the frozen-encoder, frozen-LLM bootstrapping paradigm transfers from natural images to volumetric medical imaging, providing a data- and compute-efficient template for 3D medical vision-language models. It is frequently cited in surveys of medical vision-language foundation models and informed subsequent work on adapting general VLP recipes to radiology. Its main limitations are scope and validation: pre-training centers on brain MRI for Alzheimer's cohorts rather than broad anatomy, the released checkpoints depend on external BLIP-2 weights, and zero-shot diagnostic outputs require clinical validation before any deployment.

Citation

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Preprint

Chen, Q., et al. (2023) MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts. Asian Conference on Computer Vision.

DOI: 10.48550/arXiv.2305.10799

Recent citations

Papers that recently cited this model.

Do Medical Vision Language Models Actually See? A Counterfactual Grounding Framework and Hard-Negative Contrastive Training for Visually-Reliant Medical VLMs
Anas Zafar, L. Murali, Siddhant Bharadwaj, et al.
Jul 2026
0
Gen-mentor: A human-in-the-loop instructional framework for dental radiography using generative AI
Yiyun Dong, Chuanyang Peng, Yichen Wu, et al.
Computers and Education: Artificial Intelligence · Jul 2026
0
BrainMIND-LLM: An AI-Powered Multimodal Large Language Model for Neurodegenerative Disease Analysis and Automated Clinical Report Generation
M. Chaieb, Nour Ben Ameur, Haythem Ghazouani, et al.
Expert systems with applications · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Towards Generalist Foundation Model for Radiology
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, et al.
arXiv.org · Aug 2023
228
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad, Reza Azad, Sania Eskandari, et al.
arXiv.org · Oct 2023
125
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, X. Liu, et al.
Information Fusion · May 2024
116
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin, Yifan Zhu, Xin Mei, et al.
Information Fusion · Aug 2024
85
Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare
Junling Liu, Ziming Wang, Qichen Ye, et al.
arXiv.org · Oct 2023
81

Citations

Total Citations89

Influential6

References41

GitHub

Stars57

Forks7

Open Issues4

Contributors1

Last Push2y ago

LanguagePython

LicenseMIT

Fields of citing research

Computer Science99%
Medicine92%
Engineering18%
Biology1%
Environmental Science1%
Physics1%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

35Closed

Usability — can I run it?43

Reproducibility — can I retrain it?30

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

MedQFormer bridging module: A query-transformer adapter that maps 3D image volumes into the embedding space of frozen 2D image encoders and language models, the only component trained during pre-training.

Frozen-backbone efficiency: Both the pre-trained vision encoder and the LLM remain frozen, dramatically reducing trainable parameters and the labeled data needed compared to end-to-end multimodal training.

Zero-shot classification: Distinguishes healthy controls, mild cognitive impairment (MCI), and Alzheimer's disease directly from scans plus subject text, without task-specific fine-tuning.

Medical visual question answering: Beyond categorical labels, the model answers free-form diagnostic questions about a scan, leveraging the language model's generative capabilities.

EHR-aware multimodal input: Combines image volumes with structured or textual patient information, reflecting how clinicians integrate scans with records.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Do Medical Vision Language Models Actually See? A Counterfactual Grounding Framework and Hard-Negative Contrastive Training for Visually-Reliant Medical VLMs

Anas Zafar, L. Murali, Siddhant Bharadwaj, et al.

Jul 2026

Gen-mentor: A human-in-the-loop instructional framework for dental radiography using generative AI

Yiyun Dong, Chuanyang Peng, Yichen Wu, et al.

Computers and Education: Artificial Intelligence · Jul 2026

BrainMIND-LLM: An AI-Powered Multimodal Large Language Model for Neurodegenerative Disease Analysis and Automated Clinical Report Generation

M. Chaieb, Nour Ben Ameur, Haythem Ghazouani, et al.

Expert systems with applications · Jul 2026

MedBLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Recent citations

Do Medical Vision Language Models Actually See? A Counterfactual Grounding Framework and Hard-Negative Contrastive Training for Visually-Reliant Medical VLMs

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MedBLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Recent citations

Do Medical Vision Language Models Actually See? A Counterfactual Grounding Framework and Hard-Negative Contrastive Training for Visually-Reliant Medical VLMs

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact