bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

Med-Flamingo

Stanford University / Harvard Medical School / Hospital Israelita Albert Einstein

A multimodal medical vision-language model that performs few-shot generative visual question answering over medical images and text.

Released: July 2023
Parameters: 9 Billion

Med-Flamingo is a multimodal vision-language model designed to answer questions about medical images using only a handful of in-context examples. Released in July 2023 by researchers at Stanford University, Harvard Medical School, and the Hospital Israelita Albert Einstein, it adapts the open-source OpenFlamingo architecture to the clinical domain, where labeled data is scarce and the cost of task-specific fine-tuning is high. Rather than training a dedicated model for each medical task, Med-Flamingo learns from a few image-text demonstrations supplied at inference time, mirroring the way a clinician reasons from a small number of reference cases.

The central problem Med-Flamingo addresses is the gap between general-purpose vision-language models and the specialized, knowledge-intensive nature of medical reasoning. Generic models trained on web imagery struggle with radiographs, histology, and clinical figures, while supervised medical models require large annotated datasets that rarely exist. By continuing pre-training on paired and interleaved image-text data drawn from the published medical literature and textbooks, Med-Flamingo acquires domain knowledge that transfers to new tasks in a few-shot, generative setting.

Med-Flamingo was among the first models to demonstrate genuine multimodal medical few-shot learning, including the ability to generate free-text rationales that explain its answers. This positions it as an early reference point in the rapidly growing space of medical multimodal foundation models that followed, such as LLaVA-Med and Med-PaLM M.

#Key Features

  • Few-shot medical VQA: Answers open-ended questions about medical images from only a few in-context demonstrations, without task-specific fine-tuning.
  • Interleaved image-text reasoning: Built on the Flamingo design, it consumes arbitrarily interleaved sequences of images and text, supporting multi-image and document-style prompts.
  • Rationale generation: Produces free-text explanations alongside answers, improving interpretability of its clinical reasoning.
  • Domain-adapted pre-training: Continues pre-training on medical publications and textbooks, injecting specialized knowledge absent from general models.
  • Open release with evaluation app: Model weights, code, and an interactive evaluation application are publicly available, enabling reproducible benchmarking.

#Technical Details

Med-Flamingo is built on OpenFlamingo-9B (V1), which couples a frozen CLIP ViT-L/14 vision encoder with a frozen LLaMA-7B language model via trainable cross-attention layers, for roughly 9 billion parameters in total. The authors continue pre-training this backbone on paired and interleaved medical image-text data extracted from publications and textbooks, teaching the model the visual and textual conventions of the medical literature. The team evaluated the model on several visual question-answering datasets, including a novel open-ended dataset of visual USMLE-style problems constructed to stress clinical reasoning across modalities. Because outputs are generative and open-ended, evaluation relied in part on clinician ratings, where Med-Flamingo improved generative medical VQA quality by up to roughly 20% over baselines, in addition to automated metrics.

#Applications

Med-Flamingo targets clinical and biomedical research settings where image interpretation must be paired with explanatory reasoning, such as answering questions about radiology, pathology, and other medical figures. Its few-shot design is especially useful for rare conditions or emerging tasks where annotated training data does not exist, allowing researchers to prototype medical VQA systems by supplying a few examples. The accompanying open-source evaluation app lets practitioners probe model behavior interactively and benchmark it against newer systems.

#Impact

As one of the first openly released medical multimodal few-shot learners, Med-Flamingo helped establish few-shot generative VQA and rationale generation as practical evaluation targets for clinical AI, and its public weights and code made it a common baseline in subsequent medical vision-language research. Its limitations are characteristic of early models in this space: as a research prototype built on frozen general-purpose backbones, it can hallucinate, is not validated for clinical deployment, and its training corpus of literature figures does not fully represent real-world patient imaging. Nonetheless, it remains an influential reference point for the wave of medical multimodal foundation models that followed.

Citation

Med-Flamingo: a Multimodal Medical Few-shot Learner

Preprint

Moor, M., et al. (2023) Med-Flamingo: a Multimodal Medical Few-shot Learner. ML4H@NeurIPS.

DOI: 10.48550/arXiv.2307.15189

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations566
Influential39
References34

GitHub

Stars450
Forks38
Open Issues19
Contributors1
Last Push2y ago
LanguagePython

HuggingFace

Downloads0
Likes57
Last Modified2y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
18Closed
Usability — can I run it?30
Reproducibility — can I retrain it?6
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

few_shotin_context_learningmedical_imagingmultimodalradiologyrationale_generationtransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace Model