A vision-language model that bootstraps pre-training from frozen image encoders and LLMs for 3D medical image diagnosis and visual question answering, demonstrated on brain MRI.
MedBLIP is a vision-language pre-training (VLP) framework for computer-aided diagnosis (CAD) that operates on 3D medical images paired with text, such as electronic health records. Introduced in May 2023 by Qiuhui Chen, Xinyue Hu, Zirui Wang, and Yi Hong at Shanghai Jiao Tong University, it adapts the BLIP-2 "bootstrapping language-image pre-training" recipe—originally built for 2D natural images—to the volumetric data and limited annotation budgets typical of clinical imaging. The work was later published at ACCV 2024.
The central challenge MedBLIP addresses is that the strongest pre-trained image encoders and large language models are 2D- and text-native, while diagnostic scans like brain MRI are 3D. Rather than train a large multimodal model from scratch on scarce labeled medical data, MedBLIP keeps both the image encoder and the language model frozen and learns only a lightweight bridging module. This makes the system inexpensive to train while still producing a model that can answer diagnostic questions and assign disease labels in a zero-shot setting.
MedBLIP sits within the wave of medical vision-language foundation models that followed CLIP and BLIP-2, and is notable as one of the early efforts to handle genuinely 3D volumes rather than 2D slices or single radiographs, with a focus on Alzheimer's disease staging from structural brain MRI.
MedBLIP follows the two-stream BLIP-2 design: a frozen image encoder, a frozen language model, and a trainable Querying Transformer (here, MedQFormer) that connects them. MedQFormer uses learnable query tokens to extract a fixed set of visual features from a 3D volume and project them into the language model's input space, so the LLM can condition its text generation on imaging evidence. The authors pair the framework with off-the-shelf encoders and LLMs rather than bespoke architectures, emphasizing reuse of existing foundation models.
The model was pre-trained on more than 30,000 image volumes aggregated from five public Alzheimer's disease cohorts—ADNI, NACC, OASIS, AIBL, and MIRIAD—covering structural brain MRI across the cognitive-decline spectrum. On the three-way healthy / MCI / Alzheimer's classification task, MedBLIP reports state-of-the-art zero-shot performance relative to the baselines evaluated, and it additionally demonstrates qualitative medical VQA. Code is released under the MIT license; the repository builds on the BLIP-2 weights distributed through Salesforce LAVIS.
MedBLIP targets computer-aided diagnosis workflows where a model must reason jointly over a 3D scan and accompanying clinical text. Its most developed use case is neurodegeneration screening—staging Alzheimer's disease and detecting mild cognitive impairment from brain MRI—useful to radiologists, neurologists, and researchers running large imaging cohorts. The medical VQA capability also makes it a building block for interactive diagnostic assistants and for triage tools that surface candidate findings, while the frozen-backbone design lowers the barrier for groups without the data or compute to train large multimodal models from scratch.
MedBLIP helped demonstrate that the frozen-encoder, frozen-LLM bootstrapping paradigm transfers from natural images to volumetric medical imaging, providing a data- and compute-efficient template for 3D medical vision-language models. It is frequently cited in surveys of medical vision-language foundation models and informed subsequent work on adapting general VLP recipes to radiology. Its main limitations are scope and validation: pre-training centers on brain MRI for Alzheimer's cohorts rather than broad anatomy, the released checkpoints depend on external BLIP-2 weights, and zero-shot diagnostic outputs require clinical validation before any deployment.
Chen, Q., et al. (2023) MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts. Asian Conference on Computer Vision.
DOI: 10.48550/arXiv.2305.10799Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data