Endo-FM

Chinese University of Hong Kong / Shanghai Jiao Tong University

Endoscopy video foundation model that learns spatial-temporal representations from unlabeled clips for classification, segmentation, and detection.

Released: June 2023

Endo-FM is a foundation model for endoscopy video analysis, developed by Qi Dou's lab at the Chinese University of Hong Kong with collaborators at Shanghai Jiao Tong University and presented at MICCAI 2023. Endoscopy is the primary clinical tool for examining the gastrointestinal tract, but building reliable AI assistants for tasks such as polyp diagnosis, lesion segmentation, and disease detection has historically required large, expensively annotated datasets for each individual task. Endo-FM addresses this bottleneck by learning general-purpose spatial-temporal representations from raw, unlabeled endoscopy video that can then be fine-tuned for many downstream applications.

Unlike image-level medical models that treat each frame independently, Endo-FM is designed around video. Endoscopy footage contains rich temporal cues—camera motion, tissue deformation, and the transient appearance of lesions—that single-frame models discard. By pretraining a video transformer to capture both local detail and global long-range dependencies across space and time, Endo-FM produces representations that remain robust to the motion blur, lighting changes, and viewpoint variation typical of real endoscopic procedures.

Released in mid-2023 with open code and weights, Endo-FM was among the first self-supervised video foundation models built specifically for endoscopy, sitting alongside the broader wave of medical foundation models adapted to the spatial-temporal structure of clinical video.

Key Features

Video-native representations: A video transformer captures spatial-temporal dependencies across frames, modeling motion and temporal context that frame-only models ignore.
Large-scale self-supervised pretraining: Trained without labels on a corpus aggregated from 10 datasets, removing the need for task-specific annotation during pretraining.
Global-local pretext task: A teacher-student scheme matches predictions across spatially and temporally augmented global and local views, encouraging invariant and transferable features.
Multi-task transfer: A single pretrained checkpoint fine-tunes to classification (disease diagnosis), segmentation (polyps), and detection (lesions).
Open release: Code, pretrained weights, and downstream fine-tuned checkpoints are released under the Apache 2.0 license.

Technical Details

Endo-FM uses a video transformer backbone pretrained with a self-supervised, teacher-student objective in which the student is trained to predict the teacher's representations of differently augmented global and local spatial-temporal views. Pretraining draws on a corpus assembled from 10 datasets (9 public plus a private set from the Baoshan Branch of Renji Hospital, Shanghai), comprising more than 33,000 video clips and up to roughly 5 million frames spanning multiple protocols, target organs, and disease types. On three downstream benchmarks—PolypDiag (classification), CVC-12k (segmentation), and KUMC (detection)—Endo-FM outperforms prior state-of-the-art transfer approaches, with reported gains over the VCL baseline of 3.1% F1, 4.8% Dice, and 5.5% F1 across the three tasks respectively, and larger margins over the ST-Adapter baseline.

Applications

Endo-FM serves as a pretrained backbone for computer-aided endoscopy systems. By fine-tuning the shared checkpoint, researchers and clinical AI developers can build GI disease classifiers, polyp segmentation models for colonoscopy, and lesion detectors—often with less labeled data than training from scratch would require. This is particularly valuable in medical imaging, where expert annotation is costly and many specialized datasets are small. The released downstream weights also provide ready-to-use baselines for groups developing endoscopy decision-support tools.

Impact

Endo-FM demonstrated that self-supervised pretraining on large unlabeled endoscopy video yields representations that transfer across heterogeneous tasks, helping establish video foundation models as a practical paradigm for endoscopic AI. Its open code and weights have made it a common starting point and benchmark for subsequent endoscopy and surgical-video models. Limitations include reliance on a partially private pretraining corpus that constrains exact reproducibility, evaluation focused on a small set of GI benchmarks, and the usual caveat that downstream clinical use requires prospective validation beyond the reported retrospective metrics.

Citation

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Preprint

Wang, Z., et al. (2023) Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train. International Conference on Medical Image Computing and Computer-Assisted Intervention.

DOI: 10.48550/arXiv.2306.16741

Recent citations

Papers that recently cited this model.

Joint color-spatial iterative interaction and metric-based motion filtering for unsupervised polyp segmentation in endoscopic videos.
Wenlong Song, Yiwen Jia, Jie Chen, et al.
Neural Networks · Jul 2026
0
HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space
Yaojun Hu, Kun Yuan, N. Navab, et al.
Jun 2026
0
APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms
Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Medical Image Analysis
Zongwei Zhou, V. Sodha, Jiaxuan Pang, et al.
458
On the Challenges and Perspectives of Foundation Models for Medical Image Analysis
Shaoting Zhang, Dimitris N. Metaxas
Medical Image Anal. · Jun 2023
304
Large AI Models in Health Informatics: Applications, Challenges, and the Future
Jianing Qiu, Lin Li, Jiankai Sun, et al.
IEEE journal of biomedical and health informatics · Mar 2023
213
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad, Reza Azad, Sania Eskandari, et al.
arXiv.org · Oct 2023
125
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan, Seowung Leem, Kyle B. See, et al.
IEEE Reviews in Biomedical Engineering · Jun 2024
115

Citations

Total Citations135

Influential8

References37

GitHub

Stars230

Forks30

Open Issues14

Contributors2

Last Push3mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science97%
Medicine94%
Engineering40%
Environmental Science2%
Biology2%
Art1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

76Open

Usability — can I run it?94

Reproducibility — can I retrain it?57

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Video-native representations: A video transformer captures spatial-temporal dependencies across frames, modeling motion and temporal context that frame-only models ignore.

Large-scale self-supervised pretraining: Trained without labels on a corpus aggregated from 10 datasets, removing the need for task-specific annotation during pretraining.

Global-local pretext task: A teacher-student scheme matches predictions across spatially and temporally augmented global and local views, encouraging invariant and transferable features.

Multi-task transfer: A single pretrained checkpoint fine-tunes to classification (disease diagnosis), segmentation (polyps), and detection (lesions).

Open release: Code, pretrained weights, and downstream fine-tuned checkpoints are released under the Apache 2.0 license.

Technical Details

Applications

Impact

Citation

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Preprint

DOI: 10.48550/arXiv.2306.16741

Recent citations

Papers that recently cited this model.

Joint color-spatial iterative interaction and metric-based motion filtering for unsupervised polyp segmentation in endoscopic videos.

Wenlong Song, Yiwen Jia, Jie Chen, et al.

Neural Networks · Jul 2026

HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

Yaojun Hu, Kun Yuan, N. Navab, et al.

Jun 2026

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Juntao Jiang, Jin-Feng Bai, Linxuan Fan, et al.

Jun 2026

Top citations

The most-cited papers that cite this model.

Endo-FM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Recent citations

HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Endo-FM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Recent citations

HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space

APRIL-MedSeg: A Modular Medical Image Segmentation Toolbox Embracing Modern Paradigms

Top citations

Medical Image Analysis

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact