bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

MedBLIP

Shanghai Jiao Tong University

A vision-language model that bootstraps pre-training from frozen image encoders and LLMs for 3D medical image diagnosis and visual question answering, demonstrated on brain MRI.

Released: May 2023

MedBLIP is a vision-language pre-training (VLP) framework for computer-aided diagnosis (CAD) that operates on 3D medical images paired with text, such as electronic health records. Introduced in May 2023 by Qiuhui Chen, Xinyue Hu, Zirui Wang, and Yi Hong at Shanghai Jiao Tong University, it adapts the BLIP-2 "bootstrapping language-image pre-training" recipe—originally built for 2D natural images—to the volumetric data and limited annotation budgets typical of clinical imaging. The work was later published at ACCV 2024.

The central challenge MedBLIP addresses is that the strongest pre-trained image encoders and large language models are 2D- and text-native, while diagnostic scans like brain MRI are 3D. Rather than train a large multimodal model from scratch on scarce labeled medical data, MedBLIP keeps both the image encoder and the language model frozen and learns only a lightweight bridging module. This makes the system inexpensive to train while still producing a model that can answer diagnostic questions and assign disease labels in a zero-shot setting.

MedBLIP sits within the wave of medical vision-language foundation models that followed CLIP and BLIP-2, and is notable as one of the early efforts to handle genuinely 3D volumes rather than 2D slices or single radiographs, with a focus on Alzheimer's disease staging from structural brain MRI.

#Key Features

  • MedQFormer bridging module: A query-transformer adapter that maps 3D image volumes into the embedding space of frozen 2D image encoders and language models, the only component trained during pre-training.
  • Frozen-backbone efficiency: Both the pre-trained vision encoder and the LLM remain frozen, dramatically reducing trainable parameters and the labeled data needed compared to end-to-end multimodal training.
  • Zero-shot classification: Distinguishes healthy controls, mild cognitive impairment (MCI), and Alzheimer's disease directly from scans plus subject text, without task-specific fine-tuning.
  • Medical visual question answering: Beyond categorical labels, the model answers free-form diagnostic questions about a scan, leveraging the language model's generative capabilities.
  • EHR-aware multimodal input: Combines image volumes with structured or textual patient information, reflecting how clinicians integrate scans with records.

#Technical Details

MedBLIP follows the two-stream BLIP-2 design: a frozen image encoder, a frozen language model, and a trainable Querying Transformer (here, MedQFormer) that connects them. MedQFormer uses learnable query tokens to extract a fixed set of visual features from a 3D volume and project them into the language model's input space, so the LLM can condition its text generation on imaging evidence. The authors pair the framework with off-the-shelf encoders and LLMs rather than bespoke architectures, emphasizing reuse of existing foundation models.

The model was pre-trained on more than 30,000 image volumes aggregated from five public Alzheimer's disease cohorts—ADNI, NACC, OASIS, AIBL, and MIRIAD—covering structural brain MRI across the cognitive-decline spectrum. On the three-way healthy / MCI / Alzheimer's classification task, MedBLIP reports state-of-the-art zero-shot performance relative to the baselines evaluated, and it additionally demonstrates qualitative medical VQA. Code is released under the MIT license; the repository builds on the BLIP-2 weights distributed through Salesforce LAVIS.

#Applications

MedBLIP targets computer-aided diagnosis workflows where a model must reason jointly over a 3D scan and accompanying clinical text. Its most developed use case is neurodegeneration screening—staging Alzheimer's disease and detecting mild cognitive impairment from brain MRI—useful to radiologists, neurologists, and researchers running large imaging cohorts. The medical VQA capability also makes it a building block for interactive diagnostic assistants and for triage tools that surface candidate findings, while the frozen-backbone design lowers the barrier for groups without the data or compute to train large multimodal models from scratch.

#Impact

MedBLIP helped demonstrate that the frozen-encoder, frozen-LLM bootstrapping paradigm transfers from natural images to volumetric medical imaging, providing a data- and compute-efficient template for 3D medical vision-language models. It is frequently cited in surveys of medical vision-language foundation models and informed subsequent work on adapting general VLP recipes to radiology. Its main limitations are scope and validation: pre-training centers on brain MRI for Alzheimer's cohorts rather than broad anatomy, the released checkpoints depend on external BLIP-2 weights, and zero-shot diagnostic outputs require clinical validation before any deployment.

Citation

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Preprint

Chen, Q., et al. (2023) MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts. Asian Conference on Computer Vision.

DOI: 10.48550/arXiv.2305.10799

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations85
Influential6
References41

GitHub

Stars57
Forks7
Open Issues4
Contributors1
Last Push2y ago
LanguagePython
LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
35Closed
Usability — can I run it?43
Reproducibility — can I retrain it?30
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

brain_mricomputer_aided_diagnosisimage_classificationmultimodalradiologyrepresentation_learningtransformervision_language_modelvisual_question_answeringzero_shot

Resources

GitHub RepositoryResearch Paper