National University of Singapore / Peking University
Multimodal LLM unifying 12-lead ECG time series, ECG images, and text for grounded, clinician-aligned electrocardiogram interpretation.
GEM (Grounded ECG understanding with Multimodal LLM) is a multimodal large language model for electrocardiogram (ECG) interpretation that jointly reasons over three modalities: raw 12-lead ECG time series, rendered ECG images, and natural-language text. Most prior ECG language models consume only a single modality—typically either the waveform signal or a scanned image—and produce free-text diagnoses that are difficult to verify against the underlying signal. GEM is presented as the first MLLM to unify time series, images, and text for grounded ECG analysis, meaning its diagnostic statements are tied back to specific, measurable waveform features rather than emitted as unsupported conclusions.
The model was developed by researchers at the National University of Singapore and Peking University and introduced in a March 2025 preprint, with a camera-ready version accepted to NeurIPS 2025. Beyond the model itself, the authors contribute the ECG-Grounding dataset and a "Grounded ECG Understanding" evaluation task designed to measure whether a model's reasoning aligns with clinical practice.
GEM targets a central problem in clinical decision support: a diagnosis is only trustworthy if a clinician can trace it to the evidence. By anchoring its outputs to physiological measurements such as heart rate and PR/QRS intervals, GEM aims to make automated ECG interpretation auditable and clinician-aligned.
GEM is a 7-billion-parameter multimodal LLM built on the LLaVA (v1.6-vicuna-7b) framework, with PULSE-7B supported as an alternative base MLLM. A dedicated ECG-CoCa encoder handles the signal/image modalities and feeds a cross-modal alignment stage before the language backbone. Training draws on a broad collection of public ECG corpora—including MIMIC-IV, PTB-XL, Code-15%, CPSC 2018, CSN, and G12E—together with the purpose-built ECG-Grounding dataset of roughly 30,000 instruction pairs annotated with heartbeat-level physiological features (about 43,600 total rows across train and test splits). On the authors' Grounded ECG Understanding evaluation, GEM reports a 7.4% improvement on the CSN benchmark, a 22.7% improvement in explainability, and a 24.8% improvement in grounding relative to baselines. Released weights are distributed in Safetensors format at BF16 precision via HuggingFace (LANSG/GEM).
GEM is aimed at clinical and research workflows where automated ECG reading must be both accurate and explainable. Cardiologists and emergency clinicians can use grounded interpretations to quickly review which waveform features support a given diagnosis, supporting triage and second-opinion scenarios. For ML researchers, the open ECG-Grounding dataset and the Grounded ECG Understanding task provide a reproducible benchmark for evidence-based cardiac diagnosis, and the released checkpoints offer a strong starting point for fine-tuning on institution-specific ECG corpora.
By framing ECG interpretation as a grounded, multimodal task, GEM moves automated cardiac diagnosis toward the verifiability that clinical adoption requires. Its acceptance at NeurIPS 2025, fully open weights, encoder, and grounding dataset, and an Apache-2.0 codebase (188 GitHub stars) lower the barrier for follow-on work on explainable biosignal models. The accompanying benchmark gives the community a shared yardstick for measuring whether ECG language models reason from the signal rather than around it. As a single-institution-scale model evaluated primarily on public datasets, broad clinical generalization and prospective validation remain open questions, but GEM establishes a concrete template for evidence-grounded ECG understanding.
Lan, X., et al. (2025) GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images. arXiv.org.
DOI: 10.48550/arXiv.2503.06073Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data