SleepFM

Multi-modal foundation model for sleep analysis, learning joint representations across brain, cardiac, and respiratory polysomnography signals.

Released: May 2024

SleepFM is a multi-modal foundation model for sleep analysis that learns joint representations across the three signal families recorded during overnight polysomnography (PSG): brain activity (EEG/EOG/EMG sleep-stage channels), cardiac activity (ECG), and respiratory signals. Sleep medicine relies on PSG, but the manual scoring of these recordings is labor-intensive and clinical models are typically narrow, trained from scratch for a single task such as sleep-stage classification. SleepFM instead pretrains a single set of encoders on large volumes of unlabeled PSG so the learned embeddings transfer across many downstream sleep tasks.

Developed by researchers at Stanford University and released as a preprint in May 2024, the model was trained on a large clinical cohort: over 14,000 participants comprising more than 100,000 hours of multi-modal sleep recordings. Its central contribution is a contrastive learning scheme that aligns the three physiological modalities in a shared embedding space, enabling both strong downstream classification and cross-modal retrieval between brain, cardiac, and respiratory signals.

SleepFM sits within the emerging class of biosignal foundation models, applying the self-supervised, pretrain-then-adapt paradigm—well established for protein and language models—to clinical sleep physiology, where labeled data is scarce but raw recordings are abundant.

Key Features

Tri-modal contrastive pretraining: Separate encoders for brain, ECG, and respiratory channels are trained to align corresponding clips from the same recording in a shared embedding space, learning transferable representations without task labels.
Leave-one-out contrastive objective: A novel leave-one-out formulation contrasts one modality against the others, which the authors report outperforms standard pairwise contrastive learning for this multi-modal setting.
Strong downstream transfer: Simple classifiers (e.g., logistic regression) trained on the frozen embeddings outperform end-to-end CNNs, reaching macro AUROC 0.88 for sleep-stage classification (vs. 0.72) and AUROC 0.85 for sleep-disordered-breathing detection (vs. 0.69).
Cross-modal retrieval: The aligned embeddings retrieve the matching clip in another modality with 48% top-1 average accuracy against 90,000 candidates, demonstrating coherent multi-modal alignment.
Open code and checkpoint: The implementation and a pretrained checkpoint are released under an MIT license for research use.

Technical Details

SleepFM uses 1D convolutional neural network (CNN) encoders—one per modality—operating on raw PSG time series rather than hand-engineered features. The encoders are trained jointly with a contrastive objective using a leave-one-out strategy, in which each modality's embedding is pulled toward the aggregated embedding of the remaining modalities for the same recording window and pushed away from mismatched windows. Pretraining draws on more than 100,000 hours of recordings from over 14,000 participants. After pretraining, the encoders are frozen and lightweight linear probes are fit for specific tasks. On held-out evaluation, embedding-based classifiers reach macro AUROC 0.88 for multi-class sleep staging and 0.85 for sleep-disordered breathing, substantially exceeding supervised CNN baselines (0.72 and 0.69), while the same embeddings support cross-modal retrieval at 48% top-1 accuracy from 90,000 candidates. The released checkpoint is a relatively small model, with the authors noting larger architecturally improved versions as future work.

Applications

SleepFM is aimed at researchers and clinicians working with polysomnography who need scalable, label-efficient analysis of overnight sleep studies. Because the pretrained embeddings transfer to multiple tasks, the model can support automated sleep-stage scoring, screening for sleep-disordered breathing such as apnea, and exploratory analyses that link cardiac and respiratory dynamics to brain-derived sleep architecture. Its label efficiency is particularly valuable in sleep medicine, where expert annotation of full-night recordings is costly, and the cross-modal retrieval capability offers a route to imputing or quality-checking signals when one modality is degraded or missing.

Impact

SleepFM is among the first foundation models to bring multi-modal contrastive self-supervision to clinical sleep physiology, demonstrating that a single pretrained model can outperform task-specific supervised networks across sleep-stage classification and disordered-breathing detection while also enabling cross-modal retrieval. By releasing code and a checkpoint, the work provides a reusable starting point for biosignal representation learning and contributes to the broader trend of applying the pretrain-then-adapt paradigm to clinical waveform data. Its main limitation is the modest size of the released checkpoint and reliance on a single clinical cohort, leaving external validation across diverse populations and devices as important future work.

Citation

SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

Preprint

Thapa, R., et al. (2024) SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2405.17766

Recent citations

Papers that recently cited this model.

Hallucinations in LLMs: a lifecycle-based survey of causes, detection, mitigation, and prevention
Naveen Lamba, Sanju Tiwari, Manas Gaur
International Journal of Data Science and Analysis · Jul 2026
0
From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning
Kele Xu, Yu Fang, Boda Zhou, et al.
Jul 2026
0
SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning
Yiyu Gui, Mingzhi Chen, Yuesheng Zhu, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

Scaling Wearable Foundation Models
Girish Narayanswamy, Xin Liu, Kumar Ayush, et al.
International Conference on Learning Representations · Oct 2024
58
SensorLM: Learning the Language of Wearable Sensors
Yuwei Zhang, Kumar Ayush, Siyuan Qiao, et al.
arXiv.org · Jun 2025
47
A multimodal sleep foundation model for disease prediction
R. Thapa, M. R. Kjaer, Bryan He, et al.
Nature Medicine · Jan 2026
24Influential
HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
Simon A. Lee, Cyrus Tanade, Hao Zhou, et al.
arXiv.org · Oct 2025
19
A foundational transformer leveraging full night, multichannel sleep study data accurately classifies sleep stages
Benjamin Fox, Joy Jiang, S. Wickramaratne, et al.
medRxiv · Aug 2024
19

Citations

Total Citations56

Influential6

References49

GitHub

Stars175

Forks26

Open Issues2

Contributors1

Last Push2y ago

LanguagePython

LicenseMIT

Fields of citing research

Computer Science89%
Medicine85%
Engineering35%
Biology6%
Psychology4%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

76Open

Usability — can I run it?95

Reproducibility — can I retrain it?56

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper

Key Features

Tri-modal contrastive pretraining: Separate encoders for brain, ECG, and respiratory channels are trained to align corresponding clips from the same recording in a shared embedding space, learning transferable representations without task labels.

Leave-one-out contrastive objective: A novel leave-one-out formulation contrasts one modality against the others, which the authors report outperforms standard pairwise contrastive learning for this multi-modal setting.

Strong downstream transfer: Simple classifiers (e.g., logistic regression) trained on the frozen embeddings outperform end-to-end CNNs, reaching macro AUROC 0.88 for sleep-stage classification (vs. 0.72) and AUROC 0.85 for sleep-disordered-breathing detection (vs. 0.69).

Cross-modal retrieval: The aligned embeddings retrieve the matching clip in another modality with 48% top-1 average accuracy against 90,000 candidates, demonstrating coherent multi-modal alignment.

Open code and checkpoint: The implementation and a pretrained checkpoint are released under an MIT license for research use.

Technical Details

Applications

Impact

Citation

SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

Preprint

Thapa, R., et al. (2024) SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2405.17766

Recent citations

Papers that recently cited this model.

Hallucinations in LLMs: a lifecycle-based survey of causes, detection, mitigation, and prevention

Naveen Lamba, Sanju Tiwari, Manas Gaur

International Journal of Data Science and Analysis · Jul 2026

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

Kele Xu, Yu Fang, Boda Zhou, et al.

Jul 2026

SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning

Yiyu Gui, Mingzhi Chen, Yuesheng Zhu, et al.

Jun 2026

SleepFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

Recent citations

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

SleepFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

Recent citations

From Objectives to Applications: Aligning Architectural Biases in Audio Self-Supervised Learning

SPOTR: Spatio-temporal Pooling One-Token Reconstruction for Universal Physiological Signal Self-supervised Learning

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact