MELP

Multi-scale ECG-language model that aligns 12-lead ECG signals with clinical text at token, beat, and rhythm levels for zero-shot cardiac diagnosis.

Released: June 2025

Parameters: 65.6 Million

MELP (Multi-scale ECG-Language Pretraining) is a multimodal foundation model that learns transferable representations of 12-lead electrocardiograms by aligning ECG signals with their paired free-text clinical reports. It was developed by Fuying Wang, Jiacheng Xu, and Lequan Yu at the HKU-MedAI group at The University of Hong Kong, and presented at ICML 2025. The work targets a persistent gap in ECG self-supervised learning: most prior contrastive and masked-modeling approaches treat the ECG as a single flat sequence and therefore fail to capture the signal's inherently multi-scale structure, where clinically meaningful patterns span everything from individual waveform deflections to the rhythm of an entire recording.

MELP's central idea is hierarchical cross-modal supervision. Rather than computing a single global similarity between an ECG and its report, the model aligns the two modalities at three nested scales — token, beat, and rhythm — so that fine-grained morphology and global rhythm context are each grounded in language. This mirrors how cardiologists read ECGs, reasoning simultaneously about local wave shapes (P, QRS, T) and the overall rhythm.

By learning directly from the language of clinical reports, MELP produces an ECG encoder that can perform open-vocabulary, zero-shot classification of cardiac conditions without any task-specific labels, and that transfers efficiently to labeled downstream tasks via linear probing and fine-tuning.

Key Features

Hierarchical multi-scale alignment: Cross-modal supervision is applied at the token, beat, and rhythm levels, jointly capturing local waveform morphology and global rhythm rather than a single coarse ECG-report match.
Zero-shot ECG classification: Because the encoder is grounded in clinical text, it classifies arbitrary cardiac conditions described in natural language without retraining, outperforming prior ECG SSL methods across three public benchmarks.
Strong label efficiency: Linear probing and transfer-learning evaluations show that the learned representations adapt to new datasets with limited labeled data.
Clinically grounded text encoder: MELP pairs an ECG encoder with a cardiology-oriented language model (MedCPT-Query-Encoder) to embed report text into a shared representation space.
Open release: Code is released under the MIT License and pretrained encoder weights are available on HuggingFace under Apache-2.0.

Technical Details

MELP couples a transformer-based ECG encoder (built on an ECGFM-style backbone) with a biomedical text encoder derived from MedCPT-Query-Encoder, and trains them with a combination of global CLIP-style contrastive loss, a captioning objective, and a local alignment loss that operates over the token/beat/rhythm hierarchy (reported loss weights of 1.0, 2.0, and 0.2 respectively). The released encoder comprises roughly 65.6M parameters and is distributed in BF16. Pretraining draws on large-scale paired 12-lead ECG and clinical-report data from MIMIC-IV-ECG. The authors evaluate on three public ECG datasets — including PTB-XL, CPSC 2018, and CSN/Chapman-Shaoxing — across zero-shot classification, linear probing, and transfer-learning protocols, where MELP consistently improves over existing self-supervised baselines.

Applications

MELP is aimed at automated ECG interpretation and cardiac screening, where labeled data is scarce but reports are abundant. Its zero-shot capability lets researchers query for new diagnostic categories described in plain text without curating labeled training sets, while its label-efficient representations support building classifiers for arrhythmia detection, conduction abnormalities, and other conditions from modest annotated cohorts. The pretrained encoder serves as a reusable backbone for hospitals, biomedical ML researchers, and developers building ECG analysis tools.

Impact

MELP advances the small but growing field of ECG-language foundation models by demonstrating that explicitly modeling the multi-scale structure of cardiac signals, rather than treating an ECG as a monolithic sequence, yields measurably better cross-modal alignment and stronger zero-shot and transfer performance. By open-sourcing both code and weights, the HKU-MedAI team lowers the barrier to building language-grounded ECG models. Its main limitations stem from its pretraining source: reliance on MIMIC-IV-ECG report style and population may limit generalization, and the public model card provides only sparse documentation of training data and evaluation specifics.

Citation

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Preprint

Wang, F., et al. (2025) From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2506.21803

Recent citations

Papers that recently cited this model.

Signal or Noise? Understanding Generative Models for Real-World Sensor Time Series
Zitao Shuai, Zongzhe Xu, Yuntian Wu, et al.
Jul 2026
0
Learning Cardiac Latent Representations in Vectorcardiogram Space
Bosong Huang, Panzhen Zhao, Zengxiang Li, et al.
May 2026
0
Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals
Phu X. Nguyen, Konstantinos Kontras, Wei Dai, et al.
May 2026
0

Top citations

The most-cited papers that cite this model.

The Rlign algorithm for enhanced electrocardiogram analysis through heart rate–corrected ECG alignment for explainable classification and clustering
L. Plagwitz, Lucas Bickmann, Michael Fujarski, et al.
European Heart Journal - Digital Health · Jul 2024
4
Toward robust automated cardiovascular arrhythmia detection using self-supervised learning and 1-dimensional vision transformers
Mitchell Chatterjee, Adrian D. C. Chan, Majid Komeili
Scientific Reports · Mar 2026
2
AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of
Guangkun Nie, G. Tang, Yujie Xiao, et al.
1
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL
Sheng Wong, Ravi Shankar, B. Albert, et al.
Apr 2026
0
PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence
M. Zelic, Anna Tegon, Yawei Li, et al.
Apr 2026
0

Citations

Total Citations18

Influential2

References48

GitHub

Stars31

Forks5

Open Issues3

Contributors1

Last Push4mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads18

Likes1

Last Modified10mo ago

Pipelineaudio-text-to-text

Fields of citing research

Computer Science94%
Medicine82%
Engineering35%
Biology6%
Linguistics6%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

68Partial

Usability — can I run it?78

Reproducibility — can I retrain it?55

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Hierarchical multi-scale alignment: Cross-modal supervision is applied at the token, beat, and rhythm levels, jointly capturing local waveform morphology and global rhythm rather than a single coarse ECG-report match.

Zero-shot ECG classification: Because the encoder is grounded in clinical text, it classifies arbitrary cardiac conditions described in natural language without retraining, outperforming prior ECG SSL methods across three public benchmarks.

Strong label efficiency: Linear probing and transfer-learning evaluations show that the learned representations adapt to new datasets with limited labeled data.

Clinically grounded text encoder: MELP pairs an ECG encoder with a cardiology-oriented language model (MedCPT-Query-Encoder) to embed report text into a shared representation space.

Open release: Code is released under the MIT License and pretrained encoder weights are available on HuggingFace under Apache-2.0.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Signal or Noise? Understanding Generative Models for Real-World Sensor Time Series

Zitao Shuai, Zongzhe Xu, Yuntian Wu, et al.

Jul 2026

Learning Cardiac Latent Representations in Vectorcardiogram Space

Bosong Huang, Panzhen Zhao, Zengxiang Li, et al.

May 2026

Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

Phu X. Nguyen, Konstantinos Kontras, Wei Dai, et al.

May 2026

Top citations

The most-cited papers that cite this model.

The Rlign algorithm for enhanced electrocardiogram analysis through heart rate–corrected ECG alignment for explainable classification and clustering

L. Plagwitz, Lucas Bickmann, Michael Fujarski, et al.

European Heart Journal - Digital Health · Jul 2024

Toward robust automated cardiovascular arrhythmia detection using self-supervised learning and 1-dimensional vision transformers

Mitchell Chatterjee, Adrian D. C. Chan, Majid Komeili

Scientific Reports · Mar 2026

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of

Guangkun Nie, G. Tang, Yujie Xiao, et al.

PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL

Sheng Wong, Ravi Shankar, B. Albert, et al.

Apr 2026

PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence

M. Zelic, Anna Tegon, Yawei Li, et al.

Apr 2026

MELP

#Key Features

#Technical Details

#Applications

#Impact

Citation

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Recent citations

Signal or Noise? Understanding Generative Models for Real-World Sensor Time Series

Learning Cardiac Latent Representations in Vectorcardiogram Space

Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

Top citations

The Rlign algorithm for enhanced electrocardiogram analysis through heart rate–corrected ECG alignment for explainable classification and clustering

Toward robust automated cardiovascular arrhythmia detection using self-supervised learning and 1-dimensional vision transformers

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of

PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL

PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MELP

#Key Features

#Technical Details

#Applications

#Impact

Citation

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Recent citations

Signal or Noise? Understanding Generative Models for Real-World Sensor Time Series

Learning Cardiac Latent Representations in Vectorcardiogram Space

Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

Top citations

The Rlign algorithm for enhanced electrocardiogram analysis through heart rate–corrected ECG alignment for explainable classification and clustering

Toward robust automated cardiovascular arrhythmia detection using self-supervised learning and 1-dimensional vision transformers

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of

PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL

PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact