Singapore Management University / Eindhoven University of Technology
Contrastive masked ECG-text auto-encoder pretrained on paired electrocardiograms and clinical reports, enabling label-efficient and zero-shot cardiac diagnosis.
D-BETA (Discriminative masked ECG-Text Auto-Encoder) is a self-supervised foundation model for the 12-lead electrocardiogram (ECG) that learns from raw waveforms paired with their free-text clinical reports. ECG interpretation is a cornerstone of cardiac care, but supervised deep-learning models typically require large, expensively labeled datasets and generalize poorly across the heterogeneous acquisition protocols and patient populations found in different hospitals. D-BETA addresses this by pretraining on signal-report pairs without manual diagnostic labels, producing transferable representations that perform well even when only a tiny fraction of labeled data is available downstream.
The model's central idea is to combine the complementary strengths of generative and discriminative self-supervision. Most prior ECG-text models rely on either masked reconstruction (generative) or cross-modal contrastive alignment (discriminative) alone. D-BETA unifies both in a contrastive masked auto-encoder, reconstructing masked ECG segments while simultaneously aligning ECG and text embeddings, and "boosts" the discriminative signal with a tailored negative-sampling strategy and dedicated loss functions for cross-modal matching.
D-BETA was developed by Hung Manh Pham and Dong Ma at Singapore Management University with Aaqib Saeed at Eindhoven University of Technology, and was accepted at ICML 2025 (first released as a preprint in October 2024).
D-BETA pairs a transformer-based ECG encoder with a text encoder. The ECG
branch uses eight transformer encoder layers with multi-head self-attention
operating over 12-lead waveforms, while the text branch is built on a
Flan-T5-base encoder that produces 768-dimensional embeddings; the model outputs
768-dimensional cross-modal features. Pretraining uses the MIMIC-IV-ECG v1.0
dataset, comprising roughly 800,035 ECG-report pairs from 161,352 unique
subjects (about 779,891 samples after processing). The model is evaluated on
five public benchmarks spanning diverse downstream tasks and populations —
PhysioNet 2021, PTB-XL, CSN (Chapman-Shaoxing-Ningbo), CPSC2018, and CODE-test —
where it reports an average AUC improvement of about 15% in 1%-data linear
probing and about 2% in zero-shot evaluation relative to prior ECG-text models.
Pretrained weights are released on Hugging Face and load via the transformers
AutoModel API with trust_remote_code=True (license listed as "Other").
D-BETA targets ECG-based cardiac screening and diagnosis in settings where labeled data are scarce. Its label-efficient linear probing makes it well suited to adapting a single pretrained backbone to new arrhythmia or cardiac-condition classification tasks with minimal annotation, while its zero-shot mode allows practitioners to query for conditions described in natural language without any fine-tuning. Researchers can use it as a general-purpose ECG feature extractor for downstream clinical machine-learning pipelines, and its cross-modal design supports report-aware retrieval and analysis of paired signal-text records.
D-BETA contributes to a growing body of multimodal ECG foundation models that pair physiological signals with clinical text, demonstrating that unifying generative reconstruction with boosted contrastive learning yields markedly more label-efficient representations than either objective alone. Its ICML 2025 acceptance, public code, and openly released pretrained checkpoints lower the barrier for reproducible ECG representation-learning research. Key limitations to note are that pretraining draws on a single source (MIMIC-IV-ECG), which may constrain generalization to acquisition setups and populations underrepresented in that corpus, and that the released weights carry a non-standard "Other" license requiring users to verify usage terms.
Hung, M. P., et al. (2024) Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners. International Conference on Machine Learning.
DOI: 10.48550/arXiv.2410.02131Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data