bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

EndoChat

Chinese University of Hong Kong / Huawei / Technical University of Munich / University of Strasbourg / Shandong University / Chinese Academy of Sciences

Grounded multimodal large language model for endoscopic surgery, supporting visual dialogue, region-based question answering, and bounding-box grounding across surgical scene understanding tasks.

Released: January 2025

EndoChat is a grounded multimodal large language model (MLLM) purpose-built for endoscopic and robot-assisted surgery. While general-purpose vision-language models have grown capable at everyday image understanding, they struggle with the specialized visual content, fine spatial grounding, and domain vocabulary of the surgical scene. EndoChat targets this gap by coupling a surgical vision encoder with a large language backbone so that surgeons and trainees can interrogate live endoscopic imagery through natural-language dialogue, including questions that demand precise spatial localization of instruments and tissue.

The model was introduced in January 2025 by a multi-institutional team led by the Department of Electronic Engineering at the Chinese University of Hong Kong, with collaborators at Huawei's Theory Lab, the Technical University of Munich, the University of Strasbourg (CNRS, INSERM, ICube & IHU Strasbourg), Qilu Hospital of Shandong University, and the Centre for Artificial Intelligence and Robotics under the Chinese Academy of Sciences. It is positioned alongside other surgical-vision foundation models such as Endo-FM, but distinguishes itself as a full conversational, grounding-capable MLLM rather than a single-task perception model.

To train and evaluate the system, the authors assembled Surg-396K, a large multimodal instruction dataset of roughly 396,000 image-instruction pairs derived from existing large-scale endoscopic surgery datasets through an automated annotation pipeline. This dataset underpins the model's broad coverage of dialogue styles and scene-understanding tasks.

#Key Features

  • Grounded visual dialogue: Supports five dialogue paradigms — single-phrase QA, detailed description, visual QA, region-based QA (questions referencing a bounding box), and grounding QA (answers returned as a bounding box) — enabling spatially precise surgical conversation.
  • Broad surgical scene understanding: Handles eight scene tasks including instrument counting, category identification, motion and direction recognition, object position, and instrument/tissue detection.
  • Mixed Visual Token Engine: A multi-scale visual fusion module combining DINOv2 and OpenCLIP towers to capture both global context and fine-grained surgical detail.
  • Hallucination mitigation: A visual contrast-based reasoning mechanism reduces object hallucination, a common failure mode of MLLMs in high-stakes clinical imagery.
  • Open dataset and weights: Surg-396K and model checkpoints are publicly released for the surgical-AI research community.

#Technical Details

EndoChat is built on the LLaMA2-13B language backbone (via the LLaMA2-Accessory framework) paired with a Mixed Visual Token Engine that fuses DINOv2 and OpenCLIP visual features at multiple scales, rather than relying on a single pretrained vision transformer. The model is instruction-tuned on the Surg-396K dataset of ~396K image-instruction pairs spanning the five dialogue paradigms and eight surgical scene understanding tasks. Across these benchmarks the authors report state-of-the-art performance relative to prior surgical and general MLLMs, and professional surgeons rated the majority of EndoChat's generated conversations positively in a human evaluation.

#Applications

EndoChat is intended for surgical training, intraoperative guidance, and scene comprehension in robot-assisted endoscopic procedures. A trainee can ask the model to describe a scene, count or identify instruments, or localize a structure with a bounding box, while researchers can use it as a grounded baseline for surgical vision-language tasks. The grounding capability makes it suitable for building explainable assistance tools where answers must be tied to specific image regions rather than offered as ungrounded text.

#Impact

EndoChat extends the reach of multimodal foundation models into the demanding domain of surgical endoscopy, where spatial precision and reliability are essential. By releasing both the Surg-396K dataset and model weights, the authors lower the barrier to research on conversational surgical assistants and provide a reusable benchmark for grounded surgical understanding. Its emphasis on hallucination reduction and bounding-box grounding addresses two of the most pressing obstacles to clinical adoption of MLLMs, though, as with all such systems, real-world deployment will require rigorous prospective validation before any clinical use.

Citation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Preprint

Wang, G., et al. (2025) EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery. Medical Image Anal..

DOI: 10.48550/arXiv.2501.11347

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations42
Influential2
References56

GitHub

Stars50
Forks1
Open Issues2
Contributors2
Last Push3mo ago
LanguagePython

HuggingFace

Downloads20
Likes0
Last Modified7mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
21Closed
Usability — can I run it?16
Reproducibility — can I retrain it?17
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

endoscopygrounded_dialogueinstruction_tuningmultimodalsurgical_scene_understandingsurgical_videotransformervision_language_modelvisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace Model