BiomedGPT

Lehigh University / University of Georgia / Stanford University / Massachusetts General Hospital / University of Pennsylvania / University of Central Florida / UC Santa Cruz / UTHealth Houston / Mayo Clinic / Samsung Research America

Open-source, lightweight generalist vision-language foundation model for diverse biomedical imaging and text tasks.

Released: August 2024

Parameters: 182 Million

BiomedGPT is an open-source, lightweight vision-language foundation model designed to act as a generalist across a wide range of biomedical tasks. Rather than training a separate specialist network for each problem, BiomedGPT unifies medical image understanding and clinical text processing within a single encoder-decoder transformer, allowing one model to handle visual question answering, image captioning, image classification, text understanding, and summarization. It was developed by a multi-institutional team led by Lehigh University, with collaborators at the University of Georgia, Stanford University, Massachusetts General Hospital/Harvard Medical School, the University of Pennsylvania, the University of Central Florida, UC Santa Cruz, UTHealth Houston, Mayo Clinic, and Samsung Research America.

First released as a preprint in May 2023 (arXiv:2305.17100) and published in Nature Medicine in 2024, BiomedGPT addresses a central tension in medical AI: the most capable generalist systems, such as Med-PaLM M, are enormous, proprietary, and impractical for most institutions. BiomedGPT instead demonstrates that a compact, fully transparent model can reach state-of-the-art performance while remaining deployable on modest hardware. Its largest variant has 182 million parameters, roughly 3,000 times smaller than Med-PaLM M, lowering the barrier for under-resourced hospitals and academic labs.

The model exemplifies the generalist trend in biomedical AI, where a single foundation model is pretrained across many data modalities and then fine-tuned or evaluated across heterogeneous downstream tasks, contrasting with the long-standing paradigm of narrow, single-purpose medical models.

Key Features

Generalist multimodal design: A single model spans radiology, pathology, and clinical text tasks, covering visual question answering, image captioning, classification, natural language inference, and summarization.
Lightweight and open: Three openly released variants (Small ~33M, Medium ~93M, Base ~182M parameters) make the model far cheaper to deploy than proprietary giants, with pretrained weights and code publicly available.
Strong benchmark performance: Achieves state-of-the-art results on 16 of 25 evaluated experiments despite its small scale.
Human-validated outputs: In expert assessments, it reached roughly a 3.8% error rate on visual question answering and 8.3% on radiology report generation, with summarization quality comparable to radiologists.
Unified pretraining objective: Combines masked image modeling, masked language modeling, object detection, image captioning, and image-text matching under one sequence-to-sequence framework.

Technical Details

BiomedGPT adapts the OFA (One-For-All) sequence-to-sequence architecture, pairing a BERT-style encoder over corrupted inputs with a GPT-style left-to-right autoregressive decoder, so that images, text, and bounding boxes are all cast into a shared token sequence. It was pretrained on a diverse biomedical corpus comprising roughly 592,000 images, about 183 million text sentences, 271,000 image-text pairs, and 46,000 object-label pairs, drawn from sources including chest X-rays, pathology slides, clinical notes, and PubMed literature. The model is offered in Small (33M), Medium (93M), and Base (182M) configurations. Evaluation spanned 25 datasets across five task categories, including PathVQA, VQA-RAD, and SLAKE for visual question answering; IU X-ray, MIMIC-CXR, and PEIR Gross for captioning; MedMNIST and CBIS-DDSM for classification; and MedNLI and MIMIC-III tasks for text understanding and summarization.

Applications

BiomedGPT is intended as a flexible backbone for clinical and research workflows that involve both medical imaging and text. Radiologists can use it to draft or summarize reports and answer questions about chest X-rays, pathologists can query histology images, and informatics teams can apply it to tasks such as clinical natural language inference and treatment-suggestion summarization. Because it is small and openly licensed for academic use, it is particularly attractive to under-resourced hospitals and academic groups that cannot run or pay for proprietary biomedical models.

Impact

By showing that a 182M-parameter open model can rival far larger proprietary systems across many biomedical tasks, BiomedGPT became a widely cited reference point for efficient, transparent medical foundation models and helped popularize the generalist approach in clinical AI. Its public weights and code (with follow-on checkpoints scaling up to roughly 930M parameters) have made it a practical starting point for downstream research. Important limitations remain: the released weights inherit non-commercial restrictions from the OFA framework and are intended for academic research, evaluation focuses on benchmark and retrospective data rather than prospective clinical deployment, and like all medical AI it requires careful validation before any patient-facing use.

Citation

A generalist vision–language foundation model for diverse biomedical tasks

Zhang, K., et al. (2023) A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine.

DOI: 10.1038/s41591-024-03185-2

Recent citations

Papers that recently cited this model.

Multimodal large language model (MLLM) benchmark for intelligent construction in underground engineering
Tianhao Li, Xuri Ge, Zhaoyang Wang, et al.
Automation in Construction · Aug 2026
0
Addressing benchmarking gaps in large language models for health and medicine with dynamic red-teaming
Jiazhen Pan, Bailiang Jian, Paul Hager, et al.
Nature Health · Jul 2026
0
JADE-Plus: A Multimodal Agentic Retrieval-Augmented Generation Large Language Framework for Diagnostic Support in Jawbone Lesions: Development and Technical Validation Study.
Soroush Baseri Saadi, J. Ver Berne, R. Fontenele, et al.
Journal of imaging informatics in medicine · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Towards Generalist Biomedical AI
Tao Tu, Shekoofeh Azizi, Danny Driess, et al.
NEJM AI · Jul 2023
507
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
Jindong Gu, Zhen Han, Shuo Chen, et al.
arXiv.org · 2023
233
Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, et al.
Nature Communications · Aug 2025
228
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan, Che Liu, Junde Wu, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Feb 2025
171
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI
Mahyar Abbasian, Elahe Khatibi, Iman Azimi, et al.
npj Digital Medicine · Sep 2023
167

Citations

Total Citations404

Influential25

References77

GitHub

Stars708

Forks81

Open Issues26

Contributors4

Last Push1y ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads2

Likes1

Last Modified1y ago

Fields of citing research

Computer Science96%
Medicine86%
Engineering16%
Biology7%
Linguistics3%
Environmental Science2%
Chemistry1%
Psychology1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

33Closed

Usability — can I run it?34

Reproducibility — can I retrain it?16

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper HuggingFace Model

Key Features

Generalist multimodal design: A single model spans radiology, pathology, and clinical text tasks, covering visual question answering, image captioning, classification, natural language inference, and summarization.

Lightweight and open: Three openly released variants (Small ~33M, Medium ~93M, Base ~182M parameters) make the model far cheaper to deploy than proprietary giants, with pretrained weights and code publicly available.

Strong benchmark performance: Achieves state-of-the-art results on 16 of 25 evaluated experiments despite its small scale.

Human-validated outputs: In expert assessments, it reached roughly a 3.8% error rate on visual question answering and 8.3% on radiology report generation, with summarization quality comparable to radiologists.

Unified pretraining objective: Combines masked image modeling, masked language modeling, object detection, image captioning, and image-text matching under one sequence-to-sequence framework.

Technical Details

Applications

Impact

BiomedGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

A generalist vision–language foundation model for diverse biomedical tasks

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

BiomedGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

A generalist vision–language foundation model for diverse biomedical tasks

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact