Google Research / Google DeepMind / University of Cambridge
A family of sensor-language foundation models from Google that aligns wearable biosignals with natural language for zero-shot recognition, retrieval, and captioning.
SensorLM is a family of sensor-language foundation models developed by Google Research, Google DeepMind, and the University of Cambridge to make the continuous biosignals collected by consumer wearables interpretable through natural language. Wearable devices such as Fitbit and Pixel Watch produce dense, multi-channel physiological streams—heart rate, motion, skin temperature, electrodermal activity—that are difficult to label and have historically required task-specific models for each downstream question. SensorLM addresses this gap by jointly pretraining a sensor encoder and a text model so that sensor segments and their textual descriptions live in a shared representation space.
The central obstacle to building such a model is the near-total absence of paired sensor–text data: raw wearable streams almost never come with descriptive captions. The authors overcome this with a hierarchical caption-generation pipeline that algorithmically derives statistical, structural, and semantic descriptions from the signals themselves, allowing them to assemble what they report as the largest sensor-language dataset to date. First posted to arXiv in June 2025 and presented at NeurIPS 2025, SensorLM extends established multimodal recipes (CLIP and CoCa) from images to physiological time series.
By learning the "language" of wearable sensors, SensorLM enables zero-shot and few-shot interpretation of health and activity signals without bespoke labeled datasets, positioning it alongside vision-language models as a multimodal foundation model for the biosignal domain.
SensorLM encodes a 26-feature by 1440-minute daily sensor matrix—derived from five modalities (PPG, accelerometer, electrodermal activity, skin temperature, and altimeter)—using a ViT-2D backbone with a (2, 10) patch size that yields roughly 1,872 tokens. This sensor encoder is paired with text encoder and multimodal decoder components and trained under both contrastive and generative losses. The family spans four variants: SensorLM-S (3M-parameter sensor encoder), SensorLM-B (114M), SensorLM-L (404M), and SensorLM-XL (1.27B). Pretraining used 59.7 million hours of de-identified, consented data from 103,643 individuals across 127 countries, collected over a two-month window in 2024. Across zero-shot recognition, few-shot learning, and cross-modal retrieval benchmarks, SensorLM outperforms strong baselines and exhibits the scaling, label efficiency, and zero-shot generalization to unseen tasks characteristic of large multimodal foundation models.
SensorLM targets consumer and clinical health analytics where wearable data must be summarized, queried, or classified at scale. Potential uses include zero-shot recognition of physical activities and physiological states, natural-language search over sensor archives (retrieving segments that match a textual description or vice versa), automatic captioning of a day's signals for clinicians or users, and rapid bootstrapping of new health classifiers from few labeled examples. Researchers studying digital biomarkers and population health benefit from a model that generalizes across diverse devices, demographics, and geographies without per-task annotation.
SensorLM demonstrates that the multimodal alignment paradigm underpinning vision-language models transfers to physiological time series, providing a template for connecting raw biosignals to interpretable language. By releasing benchmark results across an unprecedented 59.7-million-hour corpus and a documented model-scaling study, it establishes a reference point for the emerging field of wearable foundation models and lowers the barrier to label-efficient health analytics. Its principal limitations are that the underlying training corpus and model weights are not publicly released, constraining external reproduction, and that performance on clinically actionable endpoints beyond activity and physiological-state recognition remains to be validated prospectively.
Zhang, Y., et al. (2025) SensorLM: Learning the Language of Wearable Sensors. arXiv.org.
DOI: 10.48550/arXiv.2506.09108Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data