GEM (Grounded ECG understanding with Multimodal LLM)

National University of Singapore / Peking University

Multimodal LLM unifying 12-lead ECG time series, ECG images, and text for grounded, clinician-aligned electrocardiogram interpretation.

Released: March 2025

Parameters: 7 Billion

GEM (Grounded ECG understanding with Multimodal LLM) is a multimodal large language model for electrocardiogram (ECG) interpretation that jointly reasons over three modalities: raw 12-lead ECG time series, rendered ECG images, and natural-language text. Most prior ECG language models consume only a single modality—typically either the waveform signal or a scanned image—and produce free-text diagnoses that are difficult to verify against the underlying signal. GEM is presented as the first MLLM to unify time series, images, and text for grounded ECG analysis, meaning its diagnostic statements are tied back to specific, measurable waveform features rather than emitted as unsupported conclusions.

The model was developed by researchers at the National University of Singapore and Peking University and introduced in a March 2025 preprint, with a camera-ready version accepted to NeurIPS 2025. Beyond the model itself, the authors contribute the ECG-Grounding dataset and a "Grounded ECG Understanding" evaluation task designed to measure whether a model's reasoning aligns with clinical practice.

GEM targets a central problem in clinical decision support: a diagnosis is only trustworthy if a clinician can trace it to the evidence. By anchoring its outputs to physiological measurements such as heart rate and PR/QRS intervals, GEM aims to make automated ECG interpretation auditable and clinician-aligned.

Key Features

Tri-modal input: Jointly processes 12-lead ECG time series, ECG images, and text, allowing the model to exploit the high temporal fidelity of raw signals alongside the spatial layout of standard clinical ECG printouts.
Dual-encoder architecture: Separate encoders extract complementary features from the time-series and image modalities, which are then combined through a cross-modal alignment mechanism so signal- and image-derived evidence inform a single interpretation.
Feature grounding: Diagnoses are linked to measurable ECG parameters (e.g., intervals and rates), supporting evidence-driven reasoning rather than opaque end-to-end labels.
Knowledge-guided instruction data: A knowledge-guided generation pipeline produces granular grounding annotations that connect diagnoses to physiological features, enabling instruction tuning toward clinically meaningful explanations.
Open weights and data: GEM-7B checkpoints, the ECG-CoCa encoder, and the ECG-Grounding dataset are publicly released under an Apache-2.0 codebase.

Technical Details

GEM is a 7-billion-parameter multimodal LLM built on the LLaVA (v1.6-vicuna-7b) framework, with PULSE-7B supported as an alternative base MLLM. A dedicated ECG-CoCa encoder handles the signal/image modalities and feeds a cross-modal alignment stage before the language backbone. Training draws on a broad collection of public ECG corpora—including MIMIC-IV, PTB-XL, Code-15%, CPSC 2018, CSN, and G12E—together with the purpose-built ECG-Grounding dataset of roughly 30,000 instruction pairs annotated with heartbeat-level physiological features (about 43,600 total rows across train and test splits). On the authors' Grounded ECG Understanding evaluation, GEM reports a 7.4% improvement on the CSN benchmark, a 22.7% improvement in explainability, and a 24.8% improvement in grounding relative to baselines. Released weights are distributed in Safetensors format at BF16 precision via HuggingFace (LANSG/GEM).

Applications

GEM is aimed at clinical and research workflows where automated ECG reading must be both accurate and explainable. Cardiologists and emergency clinicians can use grounded interpretations to quickly review which waveform features support a given diagnosis, supporting triage and second-opinion scenarios. For ML researchers, the open ECG-Grounding dataset and the Grounded ECG Understanding task provide a reproducible benchmark for evidence-based cardiac diagnosis, and the released checkpoints offer a strong starting point for fine-tuning on institution-specific ECG corpora.

Impact

By framing ECG interpretation as a grounded, multimodal task, GEM moves automated cardiac diagnosis toward the verifiability that clinical adoption requires. Its acceptance at NeurIPS 2025, fully open weights, encoder, and grounding dataset, and an Apache-2.0 codebase (188 GitHub stars) lower the barrier for follow-on work on explainable biosignal models. The accompanying benchmark gives the community a shared yardstick for measuring whether ECG language models reason from the signal rather than around it. As a single-institution-scale model evaluated primarily on public datasets, broad clinical generalization and prospective validation remain open questions, but GEM establishes a concrete template for evidence-grounded ECG understanding.

Citation

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Preprint

Lan, X., et al. (2025) GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images. arXiv.org.

DOI: 10.48550/arXiv.2503.06073

Recent citations

Papers that recently cited this model.

VLT: A Vision-Language-Time Series Multimodal Foundation Model for Industrial Intelligence
Haiteng Wang, Jing Yan, Xiaokang Wang, et al.
Jul 2026
0
ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents
Dong-gyun Hong, Kyuhwan Lee, J. Kwon, et al.
Jun 2026
0
TiWeaver: Unified Temporal Dynamics Modeling via Contextual Patching
Zhe Li, Jindong Tian, Hao Miao, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-Wise Group Relative Policy Optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, et al.
IEEE International Conference on Computer Vision · Mar 2025
294
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Huanjin Yao, Qixiang Yin, Jingyi Zhang, et al.
arXiv.org · May 2025
46
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training
Wei Dai, Peilin Chen, C. Ekbote, et al.
arXiv.org · May 2025
32
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, et al.
arXiv.org · Sep 2025
20
A Survey on Agentic Multimodal Large Language Models
Huanjin Yao, Ruifei Zhang, Jiaxing Huang, et al.
arXiv.org · Oct 2025
16

Citations

Total Citations41

Influential5

References68

GitHub

Stars191

Forks18

Open Issues7

Contributors1

Last Push4mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads304

Likes7

Last Modified1y ago

Pipelineimage-text-to-text

Fields of citing research

Computer Science100%
Medicine59%
Engineering28%
Linguistics10%
Biology5%
Psychology3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

79Open

Usability — can I run it?95

Reproducibility — can I retrain it?62

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Tri-modal input: Jointly processes 12-lead ECG time series, ECG images, and text, allowing the model to exploit the high temporal fidelity of raw signals alongside the spatial layout of standard clinical ECG printouts.

Dual-encoder architecture: Separate encoders extract complementary features from the time-series and image modalities, which are then combined through a cross-modal alignment mechanism so signal- and image-derived evidence inform a single interpretation.

Feature grounding: Diagnoses are linked to measurable ECG parameters (e.g., intervals and rates), supporting evidence-driven reasoning rather than opaque end-to-end labels.

Knowledge-guided instruction data: A knowledge-guided generation pipeline produces granular grounding annotations that connect diagnoses to physiological features, enabling instruction tuning toward clinically meaningful explanations.

Open weights and data: GEM-7B checkpoints, the ECG-CoCa encoder, and the ECG-Grounding dataset are publicly released under an Apache-2.0 codebase.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

VLT: A Vision-Language-Time Series Multimodal Foundation Model for Industrial Intelligence

Haiteng Wang, Jing Yan, Xiaokang Wang, et al.

Jul 2026

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

Dong-gyun Hong, Kyuhwan Lee, J. Kwon, et al.

Jun 2026

TiWeaver: Unified Temporal Dynamics Modeling via Contextual Patching

Zhe Li, Jindong Tian, Hao Miao, et al.

Jun 2026

GEM (Grounded ECG understanding with Multimodal LLM)

#Key Features

#Technical Details

#Applications

#Impact

Citation

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Recent citations

VLT: A Vision-Language-Time Series Multimodal Foundation Model for Industrial Intelligence

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GEM (Grounded ECG understanding with Multimodal LLM)

#Key Features

#Technical Details

#Applications

#Impact

Citation

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Recent citations

VLT: A Vision-Language-Time Series Multimodal Foundation Model for Industrial Intelligence

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact