bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Imaging

BiomedCLIP

Microsoft Research

Multimodal biomedical foundation model trained on 15M PubMed Central figure-caption pairs via contrastive learning, achieving state-of-the-art zero-shot performance across imaging modalities.

Released: 2023

Overview

BiomedCLIP is a multimodal biomedical foundation model developed by Microsoft Research that learns joint representations of biomedical images and text through contrastive learning. It was pretrained on PMC-15M, a curated dataset of 15.28 million figure-caption pairs extracted from 4.4 million open-access PubMed Central articles — a scale roughly two orders of magnitude larger than prior biomedical image-text datasets such as MIMIC-CXR. The model was first released as a preprint in March 2023 and subsequently published in NEJM AI in 2024.

The central challenge BiomedCLIP addresses is the fragmentation of biomedical imaging AI: historically, models specialized in a single modality (chest X-rays, histopathology slides, microscopy images) and required domain-specific fine-tuning. By training at scale across approximately 30 biomedical image subcategories, BiomedCLIP learns visual representations that transfer across radiology, digital pathology, microscopy, and other modalities without additional adaptation. At the time of publication, BiomedCLIP achieved state-of-the-art zero-shot accuracy on a broad suite of classification, retrieval, and visual question answering (VQA) benchmarks, including outperforming radiology-specialized models on radiology-specific tasks.

Key Features

  • Cross-modality generalization: A single model covers radiology (X-ray, CT, MRI), digital pathology, fluorescence microscopy, and other biomedical imaging types without domain-specific fine-tuning, making it broadly applicable across research contexts.
  • Large-scale biomedical pretraining corpus: PMC-15M spans approximately 30 biomedical image categories drawn from peer-reviewed publications, providing high-quality paired text from domain experts rather than crowd-sourced annotation.
  • Domain-adapted text encoder: Uses PubMedBERT as the text encoder with an extended context length of 256 tokens (versus 77 in standard CLIP), covering approximately 90% of PubMed Central figure captions in full.
  • Strong zero-shot performance: Outperforms prior specialized models including BioViL, MedCLIP, and PubMedCLIP on zero-shot classification and cross-modal retrieval tasks across multiple biomedical imaging benchmarks.
  • Open model weights: Publicly released on HuggingFace under microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224, enabling downstream research and fine-tuning for new biomedical imaging tasks.

Technical Details

BiomedCLIP follows the dual-encoder contrastive learning framework of CLIP, adapted for the biomedical domain. The image encoder is a Vision Transformer ViT-B/16 initialized from ImageNet-pretrained weights, processing images at 224x224 pixel resolution with 196 patch tokens plus one [CLS] token. The text encoder is PubMedBERT, a BERT-based model pretrained on PubMed abstracts and full-text biomedical articles. Contrastive training uses an InfoNCE loss with a learned temperature parameter to maximize cosine similarity between matched image-text pairs across batches of 4,000 samples over 32 training epochs with a cosine learning rate schedule and 2,000-step linear warmup.

On zero-shot image classification benchmarks, BiomedCLIP achieves 78.95% accuracy on RSNA Pneumonia Detection (surpassing the radiology-specialized BioViL model) and 73.41% accuracy on PCam patch-level cancer detection. On cross-modal retrieval from a held-out PMC-15M validation set, Image-to-text Recall@1 reaches 82.90% compared to approximately 11% for general-purpose CLIP, illustrating how domain-specific pretraining closes the biomedical domain gap. On the SLAKE VQA benchmark, BiomedCLIP reaches accuracy comparable to Med-PaLM M despite containing far fewer parameters.

Applications

BiomedCLIP is well-suited for researchers and clinicians working across diverse biomedical imaging workflows. Its zero-shot classification capability supports rapid image triage and dataset exploration without labeled training data. Cross-modal retrieval enables text-based search of large figure archives and radiology PACS systems, as well as literature mining by image similarity. The model's dense embeddings can also serve as initialization for supervised fine-tuning in low-data regimes, such as rare disease classification or novel imaging modality adaptation. VQA capabilities support clinical education tools and decision-support prototypes when paired with additional decoder components.

Impact

BiomedCLIP established contrastive pretraining at PMC scale as a viable path to general-purpose biomedical vision-language models, and its PMC-15M dataset has become a reference benchmark for subsequent multimodal biomedical AI work. The publicly released weights have facilitated downstream research across pathology, radiology, and microscopy communities. Key limitations include a fixed 224x224 input resolution that constrains applicability to tasks requiring high spatial detail, English-only text encoding due to the PubMedBERT backbone, and a training corpus skewed toward academic figure types (including charts and diagrams) rather than purely clinical imaging workflows. As a discriminative model, BiomedCLIP does not natively generate free-text responses and requires additional decoder components for open-ended generation tasks.

Citations

DOI: 10.1056/AIoa2300114

DOI: 10.1056/AIoa2300114

Preprint

DOI: 10.48550/arXiv.2303.00915

DOI: 10.48550/arXiv.2303.00915

Metrics

HuggingFace

Downloads874.6K
Likes396
Last Modified1y ago
Pipelinezero-shot-image-classification

Tags

image analysisvision transformercontrastive learningfoundation modelmultimodalzero-shotradiology

Resources

Research PaperResearch PaperHuggingFace Model