bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

Endo-FM

Chinese University of Hong Kong / Shanghai Jiao Tong University

A foundation model for endoscopy video analysis, self-supervised pretrained on 33,000+ clips to transfer to classification, segmentation, and detection.

Released: June 2023

Endo-FM is a foundation model for endoscopy video analysis, developed by Qi Dou's lab at the Chinese University of Hong Kong with collaborators at Shanghai Jiao Tong University and presented at MICCAI 2023. Endoscopy is the primary clinical tool for examining the gastrointestinal tract, but building reliable AI assistants for tasks such as polyp diagnosis, lesion segmentation, and disease detection has historically required large, expensively annotated datasets for each individual task. Endo-FM addresses this bottleneck by learning general-purpose spatial-temporal representations from raw, unlabeled endoscopy video that can then be fine-tuned for many downstream applications.

Unlike image-level medical models that treat each frame independently, Endo-FM is designed around video. Endoscopy footage contains rich temporal cues—camera motion, tissue deformation, and the transient appearance of lesions—that single-frame models discard. By pretraining a video transformer to capture both local detail and global long-range dependencies across space and time, Endo-FM produces representations that remain robust to the motion blur, lighting changes, and viewpoint variation typical of real endoscopic procedures.

Released in mid-2023 with open code and weights, Endo-FM was among the first self-supervised video foundation models built specifically for endoscopy, sitting alongside the broader wave of medical foundation models adapted to the spatial-temporal structure of clinical video.

#Key Features

  • Video-native representations: A video transformer captures spatial-temporal dependencies across frames, modeling motion and temporal context that frame-only models ignore.
  • Large-scale self-supervised pretraining: Trained without labels on a corpus aggregated from 10 datasets, removing the need for task-specific annotation during pretraining.
  • Global-local pretext task: A teacher-student scheme matches predictions across spatially and temporally augmented global and local views, encouraging invariant and transferable features.
  • Multi-task transfer: A single pretrained checkpoint fine-tunes to classification (disease diagnosis), segmentation (polyps), and detection (lesions).
  • Open release: Code, pretrained weights, and downstream fine-tuned checkpoints are released under the Apache 2.0 license.

#Technical Details

Endo-FM uses a video transformer backbone pretrained with a self-supervised, teacher-student objective in which the student is trained to predict the teacher's representations of differently augmented global and local spatial-temporal views. Pretraining draws on a corpus assembled from 10 datasets (9 public plus a private set from the Baoshan Branch of Renji Hospital, Shanghai), comprising more than 33,000 video clips and up to roughly 5 million frames spanning multiple protocols, target organs, and disease types. On three downstream benchmarks—PolypDiag (classification), CVC-12k (segmentation), and KUMC (detection)—Endo-FM outperforms prior state-of-the-art transfer approaches, with reported gains over the VCL baseline of 3.1% F1, 4.8% Dice, and 5.5% F1 across the three tasks respectively, and larger margins over the ST-Adapter baseline.

#Applications

Endo-FM serves as a pretrained backbone for computer-aided endoscopy systems. By fine-tuning the shared checkpoint, researchers and clinical AI developers can build GI disease classifiers, polyp segmentation models for colonoscopy, and lesion detectors—often with less labeled data than training from scratch would require. This is particularly valuable in medical imaging, where expert annotation is costly and many specialized datasets are small. The released downstream weights also provide ready-to-use baselines for groups developing endoscopy decision-support tools.

#Impact

Endo-FM demonstrated that self-supervised pretraining on large unlabeled endoscopy video yields representations that transfer across heterogeneous tasks, helping establish video foundation models as a practical paradigm for endoscopic AI. Its open code and weights have made it a common starting point and benchmark for subsequent endoscopy and surgical-video models. Limitations include reliance on a partially private pretraining corpus that constrains exact reproducibility, evaluation focused on a small set of GI benchmarks, and the usual caveat that downstream clinical use requires prospective validation beyond the reported retrospective metrics.

Citation

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Preprint

Wang, Z., et al. (2023) Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.48550/arXiv.2306.16741

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations126
Influential8
References37

GitHub

Stars229
Forks30
Open Issues14
Contributors2
Last Push1mo ago
LanguagePython
LicenseApache-2.0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
76Open
Usability — can I run it?94
Reproducibility — can I retrain it?57
Model Openness Framework
Unclassified
Missing required components

Tags

detectionendoscopyfoundation_modelgastrointestinalsegmentationself_supervisedvideo_classificationvideo_transformer

Resources

GitHub RepositoryResearch Paper