EndoChat

Chinese University of Hong Kong / Huawei / Technical University of Munich / University of Strasbourg / Shandong University / Chinese Academy of Sciences

Grounded multimodal language model for endoscopic surgery, supporting visual dialogue, region-based question answering, and bounding-box grounding.

Released: January 2025

EndoChat is a grounded multimodal large language model (MLLM) purpose-built for endoscopic and robot-assisted surgery. While general-purpose vision-language models have grown capable at everyday image understanding, they struggle with the specialized visual content, fine spatial grounding, and domain vocabulary of the surgical scene. EndoChat targets this gap by coupling a surgical vision encoder with a large language backbone so that surgeons and trainees can interrogate live endoscopic imagery through natural-language dialogue, including questions that demand precise spatial localization of instruments and tissue.

The model was introduced in January 2025 by a multi-institutional team led by the Department of Electronic Engineering at the Chinese University of Hong Kong, with collaborators at Huawei's Theory Lab, the Technical University of Munich, the University of Strasbourg (CNRS, INSERM, ICube & IHU Strasbourg), Qilu Hospital of Shandong University, and the Centre for Artificial Intelligence and Robotics under the Chinese Academy of Sciences. It is positioned alongside other surgical-vision foundation models such as Endo-FM, but distinguishes itself as a full conversational, grounding-capable MLLM rather than a single-task perception model.

To train and evaluate the system, the authors assembled Surg-396K, a large multimodal instruction dataset of roughly 396,000 image-instruction pairs derived from existing large-scale endoscopic surgery datasets through an automated annotation pipeline. This dataset underpins the model's broad coverage of dialogue styles and scene-understanding tasks.

Key Features

Grounded visual dialogue: Supports five dialogue paradigms — single-phrase QA, detailed description, visual QA, region-based QA (questions referencing a bounding box), and grounding QA (answers returned as a bounding box) — enabling spatially precise surgical conversation.
Broad surgical scene understanding: Handles eight scene tasks including instrument counting, category identification, motion and direction recognition, object position, and instrument/tissue detection.
Mixed Visual Token Engine: A multi-scale visual fusion module combining DINOv2 and OpenCLIP towers to capture both global context and fine-grained surgical detail.
Hallucination mitigation: A visual contrast-based reasoning mechanism reduces object hallucination, a common failure mode of MLLMs in high-stakes clinical imagery.
Open dataset and weights: Surg-396K and model checkpoints are publicly released for the surgical-AI research community.

Technical Details

EndoChat is built on the LLaMA2-13B language backbone (via the LLaMA2-Accessory framework) paired with a Mixed Visual Token Engine that fuses DINOv2 and OpenCLIP visual features at multiple scales, rather than relying on a single pretrained vision transformer. The model is instruction-tuned on the Surg-396K dataset of ~396K image-instruction pairs spanning the five dialogue paradigms and eight surgical scene understanding tasks. Across these benchmarks the authors report state-of-the-art performance relative to prior surgical and general MLLMs, and professional surgeons rated the majority of EndoChat's generated conversations positively in a human evaluation.

Applications

EndoChat is intended for surgical training, intraoperative guidance, and scene comprehension in robot-assisted endoscopic procedures. A trainee can ask the model to describe a scene, count or identify instruments, or localize a structure with a bounding box, while researchers can use it as a grounded baseline for surgical vision-language tasks. The grounding capability makes it suitable for building explainable assistance tools where answers must be tied to specific image regions rather than offered as ungrounded text.

Impact

EndoChat extends the reach of multimodal foundation models into the demanding domain of surgical endoscopy, where spatial precision and reliability are essential. By releasing both the Surg-396K dataset and model weights, the authors lower the barrier to research on conversational surgical assistants and provide a reusable benchmark for grounded surgical understanding. Its emphasis on hallucination reduction and bounding-box grounding addresses two of the most pressing obstacles to clinical adoption of MLLMs, though, as with all such systems, real-world deployment will require rigorous prospective validation before any clinical use.

Citation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Preprint

Wang, G., et al. (2025) EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery. Medical Image Anal..

DOI: 10.48550/arXiv.2501.11347

Recent citations

Papers that recently cited this model.

SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery
Filippos Bellos, Andre S. Gala-Garza, Miaowei Wang, et al.
Jun 2026
0
Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery
Yiping Li, Ronald L. P. D. de Jong, Romy C. van Jaarsveld, et al.
Jun 2026
0Influential
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
Jiahao Meng, Yue Tan, Qi Xu, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence
Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, et al.
arXiv.org · Jun 2025
28
SurgRAW: Multi-Agent Workflow With Chain of Thought Reasoning for Robotic Surgical Video Analysis
Chang Han Low, Ziyue Wang, Tianyi Zhang, et al.
IEEE Robotics and Automation Letters · Mar 2025
17
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Shengyuan Liu, Boyun Zheng, Wenting Chen, et al.
May 2025
15
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Guan-Feng Wang, Wenjin Mo, Junyi Wang, et al.
arXiv.org · Jun 2025
12
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis
Jianhui Wei, Zikai Xiao, Danyu Sun, et al.
arXiv.org · Jun 2025
9

Citations

Total Citations44

Influential3

References56

GitHub

Stars51

Forks1

Open Issues2

Contributors2

Last Push5mo ago

LanguagePython

HuggingFace

Downloads21

Likes0

Last Modified9mo ago

Fields of citing research

Computer Science95%
Medicine93%
Engineering52%
Biology2%
Linguistics2%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

21Closed

Usability — can I run it?16

Reproducibility — can I retrain it?17

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Grounded visual dialogue: Supports five dialogue paradigms — single-phrase QA, detailed description, visual QA, region-based QA (questions referencing a bounding box), and grounding QA (answers returned as a bounding box) — enabling spatially precise surgical conversation.

Broad surgical scene understanding: Handles eight scene tasks including instrument counting, category identification, motion and direction recognition, object position, and instrument/tissue detection.

Mixed Visual Token Engine: A multi-scale visual fusion module combining DINOv2 and OpenCLIP towers to capture both global context and fine-grained surgical detail.

Hallucination mitigation: A visual contrast-based reasoning mechanism reduces object hallucination, a common failure mode of MLLMs in high-stakes clinical imagery.

Open dataset and weights: Surg-396K and model checkpoints are publicly released for the surgical-AI research community.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery

Filippos Bellos, Andre S. Gala-Garza, Miaowei Wang, et al.

Jun 2026

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Yiping Li, Ronald L. P. D. de Jong, Romy C. van Jaarsveld, et al.

Jun 2026

0Influential

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu, et al.

Jun 2026

EndoChat

#Key Features

#Technical Details

#Applications

#Impact

Citation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Recent citations

SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Top citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

EndoChat

#Key Features

#Technical Details

#Applications

#Impact

Citation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Recent citations

SurgAtlas: A Large-Scale Surgical Video-Language Dataset with 2,391 Hours of Open and Minimally Invasive Surgery

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Top citations

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact