A foundation model for 12-lead ECG that learns signal representations via multimodal contrastive pretraining against LLM-generated cardiological text.
ECG Semantic Integrator (ESI) is a foundation model for the electrocardiogram (ECG) that learns signal representations by aligning 12-lead waveforms with rich, machine-generated cardiological text. Most ECG self-supervised methods rely on signal-only objectives (masking or signal augmentations) that capture morphology but miss the clinical semantics a cardiologist would read off a trace. ESI addresses this gap by pairing each recording with detailed natural language descriptions and training the encoder to bring the two modalities into a shared embedding space.
The method has two parts. The Cardio Query Assistant (CQA) is a retrieval-augmented generation (RAG) pipeline that prompts a large language model to write per-recording descriptions, grounding the text in retrieved cardiological knowledge and conditioning on demographic and waveform-derived information so the captions reflect the specific signal rather than generic boilerplate. The ESI stage then pretrains a 1D ECG encoder against these captions using a combination of contrastive and captioning objectives.
ESI was developed by Han Yu, Peikun Guo, and Akane Sano in the Computational Wellbeing lab at Rice University, released as an arXiv preprint in May 2024 and published in Transactions on Machine Learning Research (TMLR) in 2024.
ESI couples a 1D modified ConvNeXt-V2 ECG encoder with a BioLinkBERT text encoder (michiyasunaga/BioLinkBERT-base) and trains them jointly with contrastive and captioning losses. The CQA component builds the text corpus through retrieval-augmented generation over cardiological references, using demographic and waveform information to tailor each description. Pretraining was run on the MIMIC-IV-ECG database using AdamW with a 5-epoch warm-up and a step-decay schedule on 4 NVIDIA A100 GPUs. The authors report substantial improvements over strong baselines — supervised training, signal-only self-supervised methods, and prior multimodal ECG approaches — on the two downstream evaluations, arrhythmia detection and ECG-based subject identification.
ESI targets researchers building ECG analysis systems who want representations that transfer across tasks with limited labeled data. The pretrained encoder can be fine-tuned or linearly probed for arrhythmia classification, and its embeddings support biometric subject identification from ECG. More broadly, the CQA-plus-contrastive recipe is a template for injecting clinical text knowledge into other biosignal encoders, benefiting groups that have abundant raw physiological recordings but sparse expert annotations.
ESI demonstrates that LLM-generated, retrieval-grounded clinical text can serve as an effective supervisory signal for physiological time series, extending the text-image contrastive paradigm into the biosignal domain. The work is published in TMLR with code released under GPL-3.0. A practical caveat for adopters: pretraining depends on MIMIC-IV-ECG, which is credentialed access via PhysioNet, so reproducing the full pipeline requires the user's own data access even though the authors do share a pretrained checkpoint. This dependence on a single restricted-access corpus, and evaluation on two downstream tasks, are the main limitations to weigh when assessing generalization.
Yu, H., et al. (2024) ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text. Trans. Mach. Learn. Res..
DOI: 10.48550/arXiv.2405.19366Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data