Icahn School of Medicine at Mount Sinai
A BEiT vision transformer pretrained on 8.5M 12-lead ECG images via masked image modeling, excelling at low-data-regime cardiac diagnosis.
HeartBEiT is a domain-specific vision transformer for electrocardiogram (ECG) analysis, developed by researchers at the Icahn School of Medicine at Mount Sinai and published in npj Digital Medicine in June 2023. Rather than treating the ECG as a multichannel time series, HeartBEiT treats the standard printed 12-lead ECG as an image and applies a transformer originally designed for computer vision. This reframing lets the model exploit the same visual layout that clinicians read, while inheriting the scalable self-supervised pretraining recipe of modern vision foundation models.
The central problem HeartBEiT addresses is data efficiency. Convolutional neural networks for ECG diagnosis typically require very large labeled datasets to reach clinical-grade accuracy, and transfer learning from natural-image models (e.g., ImageNet-pretrained CNNs) transfers poorly to biomedical signals. By pretraining directly on millions of unlabeled ECG images from a single health system, HeartBEiT learns ECG-specific visual representations that fine-tune effectively even when only a handful of labeled examples are available.
HeartBEiT was among the early demonstrations that the BEiT-style masked image modeling paradigm could be ported from general computer vision to a clinical biosignal, and it remains a notable reference point for image-based approaches to ECG interpretation that contrast with the more common waveform/time-series foundation models in cardiology.
HeartBEiT is built on the BEiT-base architecture, a vision transformer with roughly 86 million parameters. It was pretrained via masked image modeling on approximately 8.5 million 12-lead ECG images drawn from about 2.1 million patients in the Mount Sinai Health System. Pretraining is fully self-supervised: the model reconstructs masked image patches over visual tokens, learning ECG-specific features before any diagnostic labels are introduced. The pretrained backbone is then fine-tuned on each downstream classification task. Across diagnosis of HCM, low LVEF, and STEMI, the authors compared HeartBEiT against standard CNN architectures (such as EfficientNet and ResNet variants) at progressively smaller training sample sizes and on independent validation datasets, reporting that HeartBEiT's advantage grows as labeled data becomes scarcer.
HeartBEiT is aimed at clinical and translational cardiology settings where labeled ECG data is limited. Because it fine-tunes effectively from few examples, it is well suited to detecting conditions that are difficult or impossible to read directly from the ECG (such as low ejection fraction or hypertrophic cardiomyopathy), to building diagnostic models for rare presentations, and to institutions without millions of labeled tracings. Its image-based explanations also support clinician review and auditing of model predictions, which is valuable for deployment in decision-support workflows.
HeartBEiT helped establish image-based, self-supervised transformers as a viable direction for ECG analysis, demonstrating that domain-specific pretraining can outperform both ImageNet transfer and conventional CNNs—especially in very low-data regimes—while improving interpretability. A practical limitation for external adoption is access to the weights: the fine-tuning and checkpoint-loading code is openly available on GitHub, but the pretrained model weights are distributed only through a Mount Sinai data-sharing agreement rather than as a freely downloadable artifact, which constrains fully open reuse and reproducibility despite the public codebase.
Vaid, A., et al. (2023) A foundational vision transformer improves diagnostic performance for electrocardiograms. npj Digit. Medicine.
DOI: 10.1038/s41746-023-00840-9Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data