Med-Flamingo

Stanford University / Harvard Medical School / Hospital Israelita Albert Einstein

Multimodal medical vision-language model for few-shot visual question answering, learning new imaging tasks from in-context examples at inference.

Released: July 2023

Parameters: 9 Billion

Med-Flamingo is a multimodal vision-language model designed to answer questions about medical images using only a handful of in-context examples. Released in July 2023 by researchers at Stanford University, Harvard Medical School, and the Hospital Israelita Albert Einstein, it adapts the open-source OpenFlamingo architecture to the clinical domain, where labeled data is scarce and the cost of task-specific fine-tuning is high. Rather than training a dedicated model for each medical task, Med-Flamingo learns from a few image-text demonstrations supplied at inference time, mirroring the way a clinician reasons from a small number of reference cases.

The central problem Med-Flamingo addresses is the gap between general-purpose vision-language models and the specialized, knowledge-intensive nature of medical reasoning. Generic models trained on web imagery struggle with radiographs, histology, and clinical figures, while supervised medical models require large annotated datasets that rarely exist. By continuing pre-training on paired and interleaved image-text data drawn from the published medical literature and textbooks, Med-Flamingo acquires domain knowledge that transfers to new tasks in a few-shot, generative setting.

Med-Flamingo was among the first models to demonstrate genuine multimodal medical few-shot learning, including the ability to generate free-text rationales that explain its answers. This positions it as an early reference point in the rapidly growing space of medical multimodal foundation models that followed, such as LLaVA-Med and Med-PaLM M.

Key Features

Few-shot medical VQA: Answers open-ended questions about medical images from only a few in-context demonstrations, without task-specific fine-tuning.
Interleaved image-text reasoning: Built on the Flamingo design, it consumes arbitrarily interleaved sequences of images and text, supporting multi-image and document-style prompts.
Rationale generation: Produces free-text explanations alongside answers, improving interpretability of its clinical reasoning.
Domain-adapted pre-training: Continues pre-training on medical publications and textbooks, injecting specialized knowledge absent from general models.
Open release with evaluation app: Model weights, code, and an interactive evaluation application are publicly available, enabling reproducible benchmarking.

Technical Details

Med-Flamingo is built on OpenFlamingo-9B (V1), which couples a frozen CLIP ViT-L/14 vision encoder with a frozen LLaMA-7B language model via trainable cross-attention layers, for roughly 9 billion parameters in total. The authors continue pre-training this backbone on paired and interleaved medical image-text data extracted from publications and textbooks, teaching the model the visual and textual conventions of the medical literature. The team evaluated the model on several visual question-answering datasets, including a novel open-ended dataset of visual USMLE-style problems constructed to stress clinical reasoning across modalities. Because outputs are generative and open-ended, evaluation relied in part on clinician ratings, where Med-Flamingo improved generative medical VQA quality by up to roughly 20% over baselines, in addition to automated metrics.

Applications

Med-Flamingo targets clinical and biomedical research settings where image interpretation must be paired with explanatory reasoning, such as answering questions about radiology, pathology, and other medical figures. Its few-shot design is especially useful for rare conditions or emerging tasks where annotated training data does not exist, allowing researchers to prototype medical VQA systems by supplying a few examples. The accompanying open-source evaluation app lets practitioners probe model behavior interactively and benchmark it against newer systems.

Impact

As one of the first openly released medical multimodal few-shot learners, Med-Flamingo helped establish few-shot generative VQA and rationale generation as practical evaluation targets for clinical AI, and its public weights and code made it a common baseline in subsequent medical vision-language research. Its limitations are characteristic of early models in this space: as a research prototype built on frozen general-purpose backbones, it can hallucinate, is not validated for clinical deployment, and its training corpus of literature figures does not fully represent real-world patient imaging. Nonetheless, it remains an influential reference point for the wave of medical multimodal foundation models that followed.

Citation

Med-Flamingo: a Multimodal Medical Few-shot Learner

Preprint

Moor, M., et al. (2023) Med-Flamingo: a Multimodal Medical Few-shot Learner. ML4H@NeurIPS.

DOI: 10.48550/arXiv.2307.15189

Recent citations

Papers that recently cited this model.

Beyond textual rationales: Anatomy-grounded chain-of-thought for traceable radiology reasoning
Shengzhi Wang, Kai Wu, Jun Yang, et al.
Knowledge-Based Systems · Sep 2026
0
MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models
Hyunjae Kim, Dain Kim, Pan Xiao, et al.
Jul 2026
0
Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.
Taha Razzaq, Murtaza Taj, Asim Iqbal
Journal of Biomedical Informatics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

A survey on multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, et al.
National Science Review · Jun 2023
1.4K
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, et al.
arXiv.org · May 2023
347
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He, Rui Mao, Qika Lin, et al.
Information Fusion · Oct 2023
328Influential
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, et al.
arXiv.org · May 2025
288
Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook
Rawan AlSaad, Alaa A. Abd-alrazaq, Sabri Boughorbel, et al.
Journal of Medical Internet Research · Apr 2024
263

Citations

Total Citations618

Influential44

References34

GitHub

Stars452

Forks39

Open Issues19

Contributors1

Last Push2y ago

LanguagePython

HuggingFace

Downloads0

Likes57

Last Modified3y ago

Fields of citing research

Computer Science80%
Medicine70%
Engineering10%
Linguistics2%
Biology2%
Art1%
Environmental Science1%
Psychology0%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

18Closed

Usability — can I run it?30

Reproducibility — can I retrain it?6

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Few-shot medical VQA: Answers open-ended questions about medical images from only a few in-context demonstrations, without task-specific fine-tuning.

Interleaved image-text reasoning: Built on the Flamingo design, it consumes arbitrarily interleaved sequences of images and text, supporting multi-image and document-style prompts.

Rationale generation: Produces free-text explanations alongside answers, improving interpretability of its clinical reasoning.

Domain-adapted pre-training: Continues pre-training on medical publications and textbooks, injecting specialized knowledge absent from general models.

Open release with evaluation app: Model weights, code, and an interactive evaluation application are publicly available, enabling reproducible benchmarking.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Beyond textual rationales: Anatomy-grounded chain-of-thought for traceable radiology reasoning

Shengzhi Wang, Kai Wu, Jun Yang, et al.

Knowledge-Based Systems · Sep 2026

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Hyunjae Kim, Dain Kim, Pan Xiao, et al.

Jul 2026

Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.

Taha Razzaq, Murtaza Taj, Asim Iqbal

Journal of Biomedical Informatics · Jul 2026

Med-Flamingo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Med-Flamingo: a Multimodal Medical Few-shot Learner

Recent citations

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Med-Flamingo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Med-Flamingo: a Multimodal Medical Few-shot Learner

Recent citations

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact