SensorLM

Google Research / Google DeepMind / University of Cambridge

Sensor-language foundation models aligning wearable biosignals with text for zero-shot activity recognition, retrieval, and sensor captioning.

Released: June 2025

Parameters: 1.3 Billion

SensorLM is a family of sensor-language foundation models developed by Google Research, Google DeepMind, and the University of Cambridge to make the continuous biosignals collected by consumer wearables interpretable through natural language. Wearable devices such as Fitbit and Pixel Watch produce dense, multi-channel physiological streams—heart rate, motion, skin temperature, electrodermal activity—that are difficult to label and have historically required task-specific models for each downstream question. SensorLM addresses this gap by jointly pretraining a sensor encoder and a text model so that sensor segments and their textual descriptions live in a shared representation space.

The central obstacle to building such a model is the near-total absence of paired sensor–text data: raw wearable streams almost never come with descriptive captions. The authors overcome this with a hierarchical caption-generation pipeline that algorithmically derives statistical, structural, and semantic descriptions from the signals themselves, allowing them to assemble what they report as the largest sensor-language dataset to date. First posted to arXiv in June 2025 and presented at NeurIPS 2025, SensorLM extends established multimodal recipes (CLIP and CoCa) from images to physiological time series.

By learning the "language" of wearable sensors, SensorLM enables zero-shot and few-shot interpretation of health and activity signals without bespoke labeled datasets, positioning it alongside vision-language models as a multimodal foundation model for the biosignal domain.

Key Features

Paired sensor–language pretraining: The model aligns minutely-resolution wearable signals with text via combined contrastive and generative objectives, mirroring CLIP-style matching and CoCa-style caption generation in a single framework.
Hierarchical caption pipeline: An automated pipeline produces statistical, structural, and semantic captions directly from sensor streams, solving the scarcity of human-written sensor descriptions and enabling dataset curation at scale.
Zero- and few-shot capability: SensorLM performs activity recognition, cross-modal retrieval, and sensor captioning without task-specific fine-tuning, and adapts quickly from a handful of labeled examples.
Generative captioning: The model can describe a sensor segment in coherent text, reportedly with greater factual accuracy than generic language models prompted with raw signals.
Model family with clear scaling: Four sizes (3M to 1.27B parameters) let practitioners trade compute for accuracy, with the authors documenting consistent scaling behavior across the family.

Technical Details

SensorLM encodes a 26-feature by 1440-minute daily sensor matrix—derived from five modalities (PPG, accelerometer, electrodermal activity, skin temperature, and altimeter)—using a ViT-2D backbone with a (2, 10) patch size that yields roughly 1,872 tokens. This sensor encoder is paired with text encoder and multimodal decoder components and trained under both contrastive and generative losses. The family spans four variants: SensorLM-S (3M-parameter sensor encoder), SensorLM-B (114M), SensorLM-L (404M), and SensorLM-XL (1.27B). Pretraining used 59.7 million hours of de-identified, consented data from 103,643 individuals across 127 countries, collected over a two-month window in 2024. Across zero-shot recognition, few-shot learning, and cross-modal retrieval benchmarks, SensorLM outperforms strong baselines and exhibits the scaling, label efficiency, and zero-shot generalization to unseen tasks characteristic of large multimodal foundation models.

Applications

SensorLM targets consumer and clinical health analytics where wearable data must be summarized, queried, or classified at scale. Potential uses include zero-shot recognition of physical activities and physiological states, natural-language search over sensor archives (retrieving segments that match a textual description or vice versa), automatic captioning of a day's signals for clinicians or users, and rapid bootstrapping of new health classifiers from few labeled examples. Researchers studying digital biomarkers and population health benefit from a model that generalizes across diverse devices, demographics, and geographies without per-task annotation.

Impact

SensorLM demonstrates that the multimodal alignment paradigm underpinning vision-language models transfers to physiological time series, providing a template for connecting raw biosignals to interpretable language. By releasing benchmark results across an unprecedented 59.7-million-hour corpus and a documented model-scaling study, it establishes a reference point for the emerging field of wearable foundation models and lowers the barrier to label-efficient health analytics. Its principal limitations are that the underlying training corpus and model weights are not publicly released, constraining external reproduction, and that performance on clinically actionable endpoints beyond activity and physiological-state recognition remains to be validated prospectively.

Citation

SensorLM: Learning the Language of Wearable Sensors

Preprint

Zhang, Y., et al. (2025) SensorLM: Learning the Language of Wearable Sensors. arXiv.org.

DOI: 10.48550/arXiv.2506.09108

Recent citations

Papers that recently cited this model.

PFHAR: Practically Adopting Multi-Modal Foundation Model for Human Activity Recognition Through Edge-Cloud Collaborative Learning
Zhengyuan Zhang, Dong Zhao, Guanzhou Zhu, et al.
IEEE Transactions on Mobile Computing · Aug 2026
0
Toward Wearable Sensor-Based Human Activity Recognition: A Survey
Hailin Zou, Zijie Chen, Yuanyuan Pan, et al.
IEEE Internet of Things Journal · Jul 2026
0
Generative AI at the Edge: A Comprehensive Survey of Architectures, Hardware and Applications
Mozhgan Navardi, Yuzhe Fu, Yueqian Lin, et al.
ACM Computing Surveys · Jul 2026
0

Top citations

The most-cited papers that cite this model.

LLaSA: A Sensor-Aware LLM for Natural Language Reasoning of Human Activity from IMU Data
Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, et al.
UbiComp Companion · Jun 2024
16
ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents
Zechen Li, Baiyu Chen, Hao Xue, et al.
Aug 2025
13
Towards Generalizable Human Activity Recognition: A Survey
Yize Cai, Baoshen Guo, Flora D. Salim, et al.
arXiv.org · Aug 2025
11
AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of Recordings for Holistic Health Profiling
Guangkun Nie, Xiaocheng Fang, G. Tang, et al.
Nov 2025
8
Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition
Ilker Demirel, Karan Thakkar, Benjamin Elizalde, et al.
arXiv.org · Sep 2025
8

Citations

Total Citations58

Influential3

References50

Fields of citing research

Computer Science95%
Medicine58%
Engineering49%
Psychology5%
Biology5%
Environmental Science5%
Physics4%
Education2%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

41Partial

Usability — can I run it?35

Reproducibility — can I retrain it?39

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper Official Website

Key Features

Paired sensor–language pretraining: The model aligns minutely-resolution wearable signals with text via combined contrastive and generative objectives, mirroring CLIP-style matching and CoCa-style caption generation in a single framework.

Hierarchical caption pipeline: An automated pipeline produces statistical, structural, and semantic captions directly from sensor streams, solving the scarcity of human-written sensor descriptions and enabling dataset curation at scale.

Zero- and few-shot capability: SensorLM performs activity recognition, cross-modal retrieval, and sensor captioning without task-specific fine-tuning, and adapts quickly from a handful of labeled examples.

Generative captioning: The model can describe a sensor segment in coherent text, reportedly with greater factual accuracy than generic language models prompted with raw signals.

Model family with clear scaling: Four sizes (3M to 1.27B parameters) let practitioners trade compute for accuracy, with the authors documenting consistent scaling behavior across the family.

Technical Details

Applications

Impact

Top citations

The most-cited papers that cite this model.

LLaSA: A Sensor-Aware LLM for Natural Language Reasoning of Human Activity from IMU Data

Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, et al.

UbiComp Companion · Jun 2024

ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents

Zechen Li, Baiyu Chen, Hao Xue, et al.

Aug 2025

Towards Generalizable Human Activity Recognition: A Survey

Yize Cai, Baoshen Guo, Flora D. Salim, et al.

arXiv.org · Aug 2025

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of Recordings for Holistic Health Profiling

Guangkun Nie, Xiaocheng Fang, G. Tang, et al.

Nov 2025

Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition

Ilker Demirel, Karan Thakkar, Benjamin Elizalde, et al.

arXiv.org · Sep 2025

SensorLM

#Key Features

#Technical Details

#Applications

#Impact

Citation

SensorLM: Learning the Language of Wearable Sensors

Recent citations

Top citations

ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of Recordings for Holistic Health Profiling

Related models

Citations

Fields of citing research

Openness

Tags

Resources

SensorLM

#Key Features

#Technical Details

#Applications

#Impact

Citation

SensorLM: Learning the Language of Wearable Sensors

Recent citations

Top citations

ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents

AnyPPG: An ECG-Guided PPG Foundation Model Trained on Over 100,000 Hours of Recordings for Holistic Health Profiling

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact