Chinese University of Hong Kong / Huawei / Technical University of Munich / University of Strasbourg / Shandong University / Chinese Academy of Sciences
Grounded multimodal large language model for endoscopic surgery, supporting visual dialogue, region-based question answering, and bounding-box grounding across surgical scene understanding tasks.
EndoChat is a grounded multimodal large language model (MLLM) purpose-built for endoscopic and robot-assisted surgery. While general-purpose vision-language models have grown capable at everyday image understanding, they struggle with the specialized visual content, fine spatial grounding, and domain vocabulary of the surgical scene. EndoChat targets this gap by coupling a surgical vision encoder with a large language backbone so that surgeons and trainees can interrogate live endoscopic imagery through natural-language dialogue, including questions that demand precise spatial localization of instruments and tissue.
The model was introduced in January 2025 by a multi-institutional team led by the Department of Electronic Engineering at the Chinese University of Hong Kong, with collaborators at Huawei's Theory Lab, the Technical University of Munich, the University of Strasbourg (CNRS, INSERM, ICube & IHU Strasbourg), Qilu Hospital of Shandong University, and the Centre for Artificial Intelligence and Robotics under the Chinese Academy of Sciences. It is positioned alongside other surgical-vision foundation models such as Endo-FM, but distinguishes itself as a full conversational, grounding-capable MLLM rather than a single-task perception model.
To train and evaluate the system, the authors assembled Surg-396K, a large multimodal instruction dataset of roughly 396,000 image-instruction pairs derived from existing large-scale endoscopic surgery datasets through an automated annotation pipeline. This dataset underpins the model's broad coverage of dialogue styles and scene-understanding tasks.
EndoChat is built on the LLaMA2-13B language backbone (via the LLaMA2-Accessory framework) paired with a Mixed Visual Token Engine that fuses DINOv2 and OpenCLIP visual features at multiple scales, rather than relying on a single pretrained vision transformer. The model is instruction-tuned on the Surg-396K dataset of ~396K image-instruction pairs spanning the five dialogue paradigms and eight surgical scene understanding tasks. Across these benchmarks the authors report state-of-the-art performance relative to prior surgical and general MLLMs, and professional surgeons rated the majority of EndoChat's generated conversations positively in a human evaluation.
EndoChat is intended for surgical training, intraoperative guidance, and scene comprehension in robot-assisted endoscopic procedures. A trainee can ask the model to describe a scene, count or identify instruments, or localize a structure with a bounding box, while researchers can use it as a grounded baseline for surgical vision-language tasks. The grounding capability makes it suitable for building explainable assistance tools where answers must be tied to specific image regions rather than offered as ungrounded text.
EndoChat extends the reach of multimodal foundation models into the demanding domain of surgical endoscopy, where spatial precision and reliability are essential. By releasing both the Surg-396K dataset and model weights, the authors lower the barrier to research on conversational surgical assistants and provide a reusable benchmark for grounded surgical understanding. Its emphasis on hallucination reduction and bounding-box grounding addresses two of the most pressing obstacles to clinical adoption of MLLMs, though, as with all such systems, real-world deployment will require rigorous prospective validation before any clinical use.
Wang, G., et al. (2025) EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery. Medical Image Anal..
DOI: 10.48550/arXiv.2501.11347Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data