A multi-modal contrastive foundation model for sleep analysis, learning joint representations across brain activity, ECG, and respiratory polysomnography signals.
SleepFM is a multi-modal foundation model for sleep analysis that learns joint representations across the three signal families recorded during overnight polysomnography (PSG): brain activity (EEG/EOG/EMG sleep-stage channels), cardiac activity (ECG), and respiratory signals. Sleep medicine relies on PSG, but the manual scoring of these recordings is labor-intensive and clinical models are typically narrow, trained from scratch for a single task such as sleep-stage classification. SleepFM instead pretrains a single set of encoders on large volumes of unlabeled PSG so the learned embeddings transfer across many downstream sleep tasks.
Developed by researchers at Stanford University and released as a preprint in May 2024, the model was trained on a large clinical cohort: over 14,000 participants comprising more than 100,000 hours of multi-modal sleep recordings. Its central contribution is a contrastive learning scheme that aligns the three physiological modalities in a shared embedding space, enabling both strong downstream classification and cross-modal retrieval between brain, cardiac, and respiratory signals.
SleepFM sits within the emerging class of biosignal foundation models, applying the self-supervised, pretrain-then-adapt paradigm—well established for protein and language models—to clinical sleep physiology, where labeled data is scarce but raw recordings are abundant.
SleepFM uses 1D convolutional neural network (CNN) encoders—one per modality—operating on raw PSG time series rather than hand-engineered features. The encoders are trained jointly with a contrastive objective using a leave-one-out strategy, in which each modality's embedding is pulled toward the aggregated embedding of the remaining modalities for the same recording window and pushed away from mismatched windows. Pretraining draws on more than 100,000 hours of recordings from over 14,000 participants. After pretraining, the encoders are frozen and lightweight linear probes are fit for specific tasks. On held-out evaluation, embedding-based classifiers reach macro AUROC 0.88 for multi-class sleep staging and 0.85 for sleep-disordered breathing, substantially exceeding supervised CNN baselines (0.72 and 0.69), while the same embeddings support cross-modal retrieval at 48% top-1 accuracy from 90,000 candidates. The released checkpoint is a relatively small model, with the authors noting larger architecturally improved versions as future work.
SleepFM is aimed at researchers and clinicians working with polysomnography who need scalable, label-efficient analysis of overnight sleep studies. Because the pretrained embeddings transfer to multiple tasks, the model can support automated sleep-stage scoring, screening for sleep-disordered breathing such as apnea, and exploratory analyses that link cardiac and respiratory dynamics to brain-derived sleep architecture. Its label efficiency is particularly valuable in sleep medicine, where expert annotation of full-night recordings is costly, and the cross-modal retrieval capability offers a route to imputing or quality-checking signals when one modality is degraded or missing.
SleepFM is among the first foundation models to bring multi-modal contrastive self-supervision to clinical sleep physiology, demonstrating that a single pretrained model can outperform task-specific supervised networks across sleep-stage classification and disordered-breathing detection while also enabling cross-modal retrieval. By releasing code and a checkpoint, the work provides a reusable starting point for biosignal representation learning and contributes to the broader trend of applying the pretrain-then-adapt paradigm to clinical waveform data. Its main limitation is the modest size of the released checkpoint and reliance on a single clinical cohort, leaving external validation across diverse populations and devices as important future work.
Thapa, R., et al. (2024) SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals. International Conference on Machine Learning.
DOI: 10.48550/arXiv.2405.17766Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data