Stanford University / Harvard Medical School / Hospital Israelita Albert Einstein
A multimodal medical vision-language model that performs few-shot generative visual question answering over medical images and text.
Med-Flamingo is a multimodal vision-language model designed to answer questions about medical images using only a handful of in-context examples. Released in July 2023 by researchers at Stanford University, Harvard Medical School, and the Hospital Israelita Albert Einstein, it adapts the open-source OpenFlamingo architecture to the clinical domain, where labeled data is scarce and the cost of task-specific fine-tuning is high. Rather than training a dedicated model for each medical task, Med-Flamingo learns from a few image-text demonstrations supplied at inference time, mirroring the way a clinician reasons from a small number of reference cases.
The central problem Med-Flamingo addresses is the gap between general-purpose vision-language models and the specialized, knowledge-intensive nature of medical reasoning. Generic models trained on web imagery struggle with radiographs, histology, and clinical figures, while supervised medical models require large annotated datasets that rarely exist. By continuing pre-training on paired and interleaved image-text data drawn from the published medical literature and textbooks, Med-Flamingo acquires domain knowledge that transfers to new tasks in a few-shot, generative setting.
Med-Flamingo was among the first models to demonstrate genuine multimodal medical few-shot learning, including the ability to generate free-text rationales that explain its answers. This positions it as an early reference point in the rapidly growing space of medical multimodal foundation models that followed, such as LLaVA-Med and Med-PaLM M.
Med-Flamingo is built on OpenFlamingo-9B (V1), which couples a frozen CLIP ViT-L/14 vision encoder with a frozen LLaMA-7B language model via trainable cross-attention layers, for roughly 9 billion parameters in total. The authors continue pre-training this backbone on paired and interleaved medical image-text data extracted from publications and textbooks, teaching the model the visual and textual conventions of the medical literature. The team evaluated the model on several visual question-answering datasets, including a novel open-ended dataset of visual USMLE-style problems constructed to stress clinical reasoning across modalities. Because outputs are generative and open-ended, evaluation relied in part on clinician ratings, where Med-Flamingo improved generative medical VQA quality by up to roughly 20% over baselines, in addition to automated metrics.
Med-Flamingo targets clinical and biomedical research settings where image interpretation must be paired with explanatory reasoning, such as answering questions about radiology, pathology, and other medical figures. Its few-shot design is especially useful for rare conditions or emerging tasks where annotated training data does not exist, allowing researchers to prototype medical VQA systems by supplying a few examples. The accompanying open-source evaluation app lets practitioners probe model behavior interactively and benchmark it against newer systems.
As one of the first openly released medical multimodal few-shot learners, Med-Flamingo helped establish few-shot generative VQA and rationale generation as practical evaluation targets for clinical AI, and its public weights and code made it a common baseline in subsequent medical vision-language research. Its limitations are characteristic of early models in this space: as a research prototype built on frozen general-purpose backbones, it can hallucinate, is not validated for clinical deployment, and its training corpus of literature figures does not fully represent real-world patient imaging. Nonetheless, it remains an influential reference point for the wave of medical multimodal foundation models that followed.
Moor, M., et al. (2023) Med-Flamingo: a Multimodal Medical Few-shot Learner. ML4H@NeurIPS.
DOI: 10.48550/arXiv.2307.15189Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data