Self-supervised ECG foundation model that treats heartbeats as words and rhythms as sentences, using a QRS-Tokenizer and dual-level pretraining on MIMIC-IV-ECG.
HeartLang is a self-supervised foundation model for the electrocardiogram (ECG) that reframes signal modeling as a language-modeling problem. Rather than carving the waveform into fixed-length time windows—the dominant practice in deep ECG models—it treats individual heartbeats as "words" and the sequence of beats that forms a rhythm strip as a "sentence." This semantic segmentation is designed to respect the natural structure of cardiac signals, where the clinically meaningful unit is the heartbeat and its morphology, not an arbitrary slice of time.
The model was introduced in "Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model" by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, and Shenda Hong from the PKUDigitalHealth group at Peking University, and accepted to ICLR 2025. It sits within the broader wave of biosignal foundation models that adapt masked-prediction pretraining—popularized by language and vision models—to physiological time series, and is distinguished by its explicitly linguistic, heartbeat-centric tokenization.
By pretraining on a large corpus of unlabeled ECGs and transferring to downstream diagnostic tasks, HeartLang aims to reduce the heavy annotation burden that has limited supervised ECG deep learning, while producing representations that capture both single-beat form and longer-range rhythm context.
HeartLang uses a transformer backbone trained in two stages. First, the VQ-HBR module encodes each tokenized heartbeat into a discrete code drawn from an 8192-entry codebook, establishing the ECG "vocabulary"; this stage is reconstruction-based and vector quantized. Second, a masked ECG sentence pretraining stage learns rhythm-level representations by masking and predicting heartbeat tokens across sequences. Pretraining uses the MIMIC-IV-ECG dataset from PhysioNet, a large collection of 12-lead clinical recordings, run for roughly 200 epochs with learning-rate scheduling; the reference implementation trains VQ-HBR on 8 NVIDIA RTX 4090 GPUs. The model is evaluated across six public ECG datasets, including diagnostic subsets of PTB-XL, CPSC2018, and the Chapman-Shaoxing-Ningbo (CSN) arrhythmia dataset, where the authors report improved downstream classification over prior self-supervised ECG baselines. Code and pretrained checkpoints (the pretraining and VQ-HBR weights) are released under an MIT license.
HeartLang targets automated ECG interpretation tasks such as multi-label diagnostic classification and arrhythmia detection. Because it is pretrained self-supervised on unlabeled recordings, it is particularly useful in settings where labeled ECGs are limited: a hospital or research group can fine-tune the released checkpoints on a modest annotated dataset rather than training from scratch. Beyond classification, its heartbeat-level discrete vocabulary and learned embeddings can serve as reusable features for downstream cardiac analysis, and the framework offers a template for applying language-model-style pretraining to other quasi-periodic physiological signals.
By recasting ECG modeling as learning "words" and "sentences," HeartLang contributes a distinctive, biologically motivated tokenization strategy to the rapidly growing space of biosignal foundation models, and its acceptance at ICLR 2025 reflects interest in structure-aware self-supervised approaches. The public release of code, an 8192-entry heartbeat codebook, and pretrained weights lowers the barrier for downstream ECG research. Key limitations include dependence on accurate QRS detection for tokenization—noisy or abnormal beats may be mis-segmented—and evaluation centered on standard public benchmarks, so prospective clinical validation and robustness across diverse populations and devices remain open questions.
Jin, J., et al. (2025) Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model. International Conference on Learning Representations.
DOI: 10.48550/arXiv.2502.10707Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data