Multimodal biomedical foundation model trained on 15M PubMed Central figure-caption pairs via contrastive learning, achieving state-of-the-art zero-shot performance across imaging modalities.
BiomedCLIP is a multimodal biomedical foundation model developed by Microsoft Research that learns joint representations of biomedical images and text through contrastive learning. It was pretrained on PMC-15M, a curated dataset of 15.28 million figure-caption pairs extracted from 4.4 million open-access PubMed Central articles — a scale roughly two orders of magnitude larger than prior biomedical image-text datasets such as MIMIC-CXR. The model was first released as a preprint in March 2023 and subsequently published in NEJM AI in 2024.
The central challenge BiomedCLIP addresses is the fragmentation of biomedical imaging AI: historically, models specialized in a single modality (chest X-rays, histopathology slides, microscopy images) and required domain-specific fine-tuning. By training at scale across approximately 30 biomedical image subcategories, BiomedCLIP learns visual representations that transfer across radiology, digital pathology, microscopy, and other modalities without additional adaptation. At the time of publication, BiomedCLIP achieved state-of-the-art zero-shot accuracy on a broad suite of classification, retrieval, and visual question answering (VQA) benchmarks, including outperforming radiology-specialized models on radiology-specific tasks.
microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224, enabling downstream research and fine-tuning for new biomedical imaging tasks.BiomedCLIP follows the dual-encoder contrastive learning framework of CLIP, adapted for the biomedical domain. The image encoder is a Vision Transformer ViT-B/16 initialized from ImageNet-pretrained weights, processing images at 224x224 pixel resolution with 196 patch tokens plus one [CLS] token. The text encoder is PubMedBERT, a BERT-based model pretrained on PubMed abstracts and full-text biomedical articles. Contrastive training uses an InfoNCE loss with a learned temperature parameter to maximize cosine similarity between matched image-text pairs across batches of 4,000 samples over 32 training epochs with a cosine learning rate schedule and 2,000-step linear warmup.
On zero-shot image classification benchmarks, BiomedCLIP achieves 78.95% accuracy on RSNA Pneumonia Detection (surpassing the radiology-specialized BioViL model) and 73.41% accuracy on PCam patch-level cancer detection. On cross-modal retrieval from a held-out PMC-15M validation set, Image-to-text Recall@1 reaches 82.90% compared to approximately 11% for general-purpose CLIP, illustrating how domain-specific pretraining closes the biomedical domain gap. On the SLAKE VQA benchmark, BiomedCLIP reaches accuracy comparable to Med-PaLM M despite containing far fewer parameters.
BiomedCLIP is well-suited for researchers and clinicians working across diverse biomedical imaging workflows. Its zero-shot classification capability supports rapid image triage and dataset exploration without labeled training data. Cross-modal retrieval enables text-based search of large figure archives and radiology PACS systems, as well as literature mining by image similarity. The model's dense embeddings can also serve as initialization for supervised fine-tuning in low-data regimes, such as rare disease classification or novel imaging modality adaptation. VQA capabilities support clinical education tools and decision-support prototypes when paired with additional decoder components.
BiomedCLIP established contrastive pretraining at PMC scale as a viable path to general-purpose biomedical vision-language models, and its PMC-15M dataset has become a reference benchmark for subsequent multimodal biomedical AI work. The publicly released weights have facilitated downstream research across pathology, radiology, and microscopy communities. Key limitations include a fixed 224x224 input resolution that constrains applicability to tasks requiring high spatial detail, English-only text encoding due to the PubMedBERT backbone, and a training corpus skewed toward academic figure types (including charts and diagrams) rather than purely clinical imaging workflows. As a discriminative model, BiomedCLIP does not natively generate free-text responses and requires additional decoder components for open-ended generation tasks.