Chinese University of Hong Kong / Shanghai Jiao Tong University
A foundation model for endoscopy video analysis, self-supervised pretrained on 33,000+ clips to transfer to classification, segmentation, and detection.
Endo-FM is a foundation model for endoscopy video analysis, developed by Qi Dou's lab at the Chinese University of Hong Kong with collaborators at Shanghai Jiao Tong University and presented at MICCAI 2023. Endoscopy is the primary clinical tool for examining the gastrointestinal tract, but building reliable AI assistants for tasks such as polyp diagnosis, lesion segmentation, and disease detection has historically required large, expensively annotated datasets for each individual task. Endo-FM addresses this bottleneck by learning general-purpose spatial-temporal representations from raw, unlabeled endoscopy video that can then be fine-tuned for many downstream applications.
Unlike image-level medical models that treat each frame independently, Endo-FM is designed around video. Endoscopy footage contains rich temporal cues—camera motion, tissue deformation, and the transient appearance of lesions—that single-frame models discard. By pretraining a video transformer to capture both local detail and global long-range dependencies across space and time, Endo-FM produces representations that remain robust to the motion blur, lighting changes, and viewpoint variation typical of real endoscopic procedures.
Released in mid-2023 with open code and weights, Endo-FM was among the first self-supervised video foundation models built specifically for endoscopy, sitting alongside the broader wave of medical foundation models adapted to the spatial-temporal structure of clinical video.
Endo-FM uses a video transformer backbone pretrained with a self-supervised, teacher-student objective in which the student is trained to predict the teacher's representations of differently augmented global and local spatial-temporal views. Pretraining draws on a corpus assembled from 10 datasets (9 public plus a private set from the Baoshan Branch of Renji Hospital, Shanghai), comprising more than 33,000 video clips and up to roughly 5 million frames spanning multiple protocols, target organs, and disease types. On three downstream benchmarks—PolypDiag (classification), CVC-12k (segmentation), and KUMC (detection)—Endo-FM outperforms prior state-of-the-art transfer approaches, with reported gains over the VCL baseline of 3.1% F1, 4.8% Dice, and 5.5% F1 across the three tasks respectively, and larger margins over the ST-Adapter baseline.
Endo-FM serves as a pretrained backbone for computer-aided endoscopy systems. By fine-tuning the shared checkpoint, researchers and clinical AI developers can build GI disease classifiers, polyp segmentation models for colonoscopy, and lesion detectors—often with less labeled data than training from scratch would require. This is particularly valuable in medical imaging, where expert annotation is costly and many specialized datasets are small. The released downstream weights also provide ready-to-use baselines for groups developing endoscopy decision-support tools.
Endo-FM demonstrated that self-supervised pretraining on large unlabeled endoscopy video yields representations that transfer across heterogeneous tasks, helping establish video foundation models as a practical paradigm for endoscopic AI. Its open code and weights have made it a common starting point and benchmark for subsequent endoscopy and surgical-video models. Limitations include reliance on a partially private pretraining corpus that constrains exact reproducibility, evaluation focused on a small set of GI benchmarks, and the usual caveat that downstream clinical use requires prospective validation beyond the reported retrospective metrics.
Wang, Z., et al. (2023) Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train. International Conference on Medical Image Computing and Computer-Assisted Intervention.
DOI: 10.48550/arXiv.2306.16741Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data