China University of Geosciences / Beijing Normal University / Southern University of Science and Technology
A multimodal ECG-language model that aligns 12-lead ECG waveforms with clinical text for conversational cardiac diagnosis and automated report generation.
ECG-Chat is a multimodal large language model built to bring conversational AI to the electrocardiogram (ECG), the most common cardiac diagnostic test. While general-purpose multimodal LLMs handle natural images and text well, they struggle with the time-series structure of physiological waveforms and with the specialized vocabulary of cardiology. ECG-Chat addresses this gap by jointly modeling raw 12-lead ECG signals and clinical report text, enabling multi-turn dialogue about a patient's recording, automated report drafting, and cardiac disease classification within a single system.
The model was introduced by Yubao Zhao and colleagues at China University of Geosciences, with collaborators at Beijing Normal University, Southern University of Science and Technology, ESIGELEC, and the University of Liverpool. The work was first released as a preprint in August 2024 and subsequently accepted to the 2025 IEEE International Conference on Multimedia and Expo (ICME).
ECG-Chat combines two ideas that have driven recent progress in medical AI: contrastive signal-text pretraining (in the spirit of CLIP) to learn aligned ECG and report representations, and a LLaVA-style instruction-tuned LLM that grounds a language model in those learned signal features. The result is a system that can both retrieve relevant reports in a zero-shot setting and generate free-text diagnostic narratives.
ECG-Chat has two stages. First, ECG-CoCa is contrastively pretrained on paired ECG-report data drawn from five public datasets: MIMIC-IV-ECG, the Chapman-Shaoxing-Ningbo collection, Shandong Provincial Hospital (SPH), PTB-XL, and CPSC2018, with augmentation via the torch_ecg library. Second, the learned ECG features are fed, LLaVA-style, into a Vicuna-13B LLM that is instruction-tuned (with LoRA) on the GPT-4o-curated ECG-Instruct corpus. On zero-shot ECG-report retrieval over the PTB-XL test set (2K samples), the CoCa encoder with waveform-driven embedding reaches Recall@1 of 64.7% (ECG-to-report) and 71.6% (report-to-ECG). For report generation it reports BLEU-4 of 11.19 and ROUGE-L of 29.93, with disease and rhythm F1 of 22.33 and 43.39. Zero-shot classification F1 reaches 52.8% on PTB-XL and 80.1% on CPSC2018.
ECG-Chat targets clinical and research workflows where ECG interpretation is a bottleneck. Cardiologists and general clinicians could use it as a drafting assistant that generates a first-pass report and answers follow-up questions about rhythm, conduction, and disease findings, while researchers can use the ECG-CoCa embeddings for zero-shot screening, cohort retrieval, and transfer learning across ECG datasets. Its conversational interface also makes it a candidate for medical education and for triage settings where a structured, explainable summary is more useful than a single label.
ECG-Chat is part of a wave of waveform-language foundation models extending the vision-language recipe to physiological signals, and it demonstrates that contrastive signal-text pretraining plus instruction tuning can support both retrieval and generative diagnosis from raw ECGs. Its acceptance at ICME 2025 and open release of code and pretrained checkpoints lower the barrier for follow-up work. Important caveats remain: the model is a research prototype validated on retrospective public benchmarks rather than in prospective clinical trials, report-generation BLEU/ROUGE scores indicate meaningful room for improvement, and the released weights are distributed via Google Drive without a stated software license, which may limit reuse.
Zhao, Y., et al. (2024) ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis. IEEE International Conference on Multimedia and Expo.
DOI: 10.1109/ICME59968.2025.11209476Zhao, Y., et al. (2024) ECG-Chat: A Large ECG-Language Model for Cardiac Disease Diagnosis. IEEE International Conference on Multimedia and Expo.
DOI: 10.48550/arXiv.2408.08849Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data