Multi-scale ECG-language pretraining model that aligns 12-lead ECG signals with clinical text at token, beat, and rhythm levels for zero-shot cardiac diagnosis.
MELP (Multi-scale ECG-Language Pretraining) is a multimodal foundation model that learns transferable representations of 12-lead electrocardiograms by aligning ECG signals with their paired free-text clinical reports. It was developed by Fuying Wang, Jiacheng Xu, and Lequan Yu at the HKU-MedAI group at The University of Hong Kong, and presented at ICML 2025. The work targets a persistent gap in ECG self-supervised learning: most prior contrastive and masked-modeling approaches treat the ECG as a single flat sequence and therefore fail to capture the signal's inherently multi-scale structure, where clinically meaningful patterns span everything from individual waveform deflections to the rhythm of an entire recording.
MELP's central idea is hierarchical cross-modal supervision. Rather than computing a single global similarity between an ECG and its report, the model aligns the two modalities at three nested scales — token, beat, and rhythm — so that fine-grained morphology and global rhythm context are each grounded in language. This mirrors how cardiologists read ECGs, reasoning simultaneously about local wave shapes (P, QRS, T) and the overall rhythm.
By learning directly from the language of clinical reports, MELP produces an ECG encoder that can perform open-vocabulary, zero-shot classification of cardiac conditions without any task-specific labels, and that transfers efficiently to labeled downstream tasks via linear probing and fine-tuning.
MELP couples a transformer-based ECG encoder (built on an ECGFM-style backbone) with a biomedical text encoder derived from MedCPT-Query-Encoder, and trains them with a combination of global CLIP-style contrastive loss, a captioning objective, and a local alignment loss that operates over the token/beat/rhythm hierarchy (reported loss weights of 1.0, 2.0, and 0.2 respectively). The released encoder comprises roughly 65.6M parameters and is distributed in BF16. Pretraining draws on large-scale paired 12-lead ECG and clinical-report data from MIMIC-IV-ECG. The authors evaluate on three public ECG datasets — including PTB-XL, CPSC 2018, and CSN/Chapman-Shaoxing — across zero-shot classification, linear probing, and transfer-learning protocols, where MELP consistently improves over existing self-supervised baselines.
MELP is aimed at automated ECG interpretation and cardiac screening, where labeled data is scarce but reports are abundant. Its zero-shot capability lets researchers query for new diagnostic categories described in plain text without curating labeled training sets, while its label-efficient representations support building classifiers for arrhythmia detection, conduction abnormalities, and other conditions from modest annotated cohorts. The pretrained encoder serves as a reusable backbone for hospitals, biomedical ML researchers, and developers building ECG analysis tools.
MELP advances the small but growing field of ECG-language foundation models by demonstrating that explicitly modeling the multi-scale structure of cardiac signals, rather than treating an ECG as a monolithic sequence, yields measurably better cross-modal alignment and stronger zero-shot and transfer performance. By open-sourcing both code and weights, the HKU-MedAI team lowers the barrier to building language-grounded ECG models. Its main limitations stem from its pretraining source: reliance on MIMIC-IV-ECG report style and population may limit generalization, and the public model card provides only sparse documentation of training data and evaluation specifics.
Wang, F., et al. (2025) From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining. International Conference on Machine Learning.
DOI: 10.48550/arXiv.2506.21803Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data