Masked-autoencoder foundation models pretrained on millions of unlabeled digital-stethoscope PCG and ECG recordings, fine-tuned for cardiovascular disease detection.
Digital stethoscopes capture two synchronized biosignals at the point of care: the phonocardiogram (PCG), an acoustic recording of heart sounds, and a single-lead electrocardiogram (ECG). While these signals carry rich information about cardiovascular disease, building accurate detection algorithms has historically been limited by the scarcity of expertly annotated recordings. This work, published by Eko Health in npj Cardiovascular Health in October 2024, addresses that bottleneck by pretraining transformer-based foundation models on large volumes of unlabeled stethoscope data and then fine-tuning them for specific clinical tasks.
The authors adapt the masked autoencoder (MAE) self-supervised framework, originally developed for images, to single- and multi-signal stethoscope data. PCG recordings are converted to mel-spectrograms and split into patches, ECG signals are split into temporal segments, and the model learns to reconstruct masked portions of each. By learning general-purpose representations from recordings collected during routine clinical practice, the resulting encoders transfer effectively to downstream detection problems where labeled data is limited.
This is, to the authors' knowledge, the first foundation-model approach built specifically for synchronously captured PCG and ECG from digital stethoscopes, and it demonstrates strong performance across structural murmur, atrial fibrillation, and reduced ejection fraction detection.
The architecture is a "base"-scale vision-transformer-style MAE. The encoder comprises 12 transformer layers with a 768-dimensional embedding, 3072-dimensional feed-forward blocks, and 12 attention heads, totaling 85,254,144 trainable parameters; a lighter decoder (4 layers, 384-dimensional embedding, 6 heads, 7,492,864 parameters) is used only during pretraining, for a combined 92.7M parameters. Pretraining corpora span 1,890,304 PCG recordings, 241,664 ECG recordings, and 221,184 paired PCG-ECG recordings. After fine-tuning, the models reach an AUROC of 98.3% (99.0% on real-world evidence) for structural-heart-disease murmur detection, 98.0% (97.9% on a held-out test set) for atrial fibrillation, and 84.5% for low ejection fraction detection, with self-supervised pretraining consistently improving over training from scratch on the same labeled data.
The models target point-of-care cardiovascular screening using hardware already deployed in clinics. Fine-tuned detectors for structural murmurs, atrial fibrillation, and reduced ejection fraction could help primary-care clinicians flag patients for echocardiography or specialist referral during routine exams, including in resource-limited settings where access to cardiology is scarce. More broadly, the pretrained encoders provide a reusable starting point for developing additional stethoscope-based biosignal classifiers without assembling large labeled datasets for each new condition.
The work demonstrates that self-supervised foundation models, well established in imaging and language, transfer effectively to synchronized cardiac biosignals, with pretraining on unlabeled clinical recordings delivering measurable gains on label-scarce detection tasks. It is an industry-led example of leveraging proprietary device data at scale for medical AI. A notable limitation for the open-research community is reproducibility: neither the code nor the model weights are publicly released, and the authors state the implementation code "will be made available upon request," so independent verification and reuse are constrained.
Mathew, G., et al. (2024) Foundation models for cardiovascular disease detection via biosignals from digital stethoscopes. npj Cardiovascular Health.
DOI: 10.1038/s44325-024-00027-5Mathew, G., et al. (2024) Foundation Models for Cardiovascular Disease Detection via BioSignals from Digital Stethoscopes. Springer Science and Business Media LLC.
DOI: 10.21203/rs.3.rs-4732737/v1Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data