bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Biosignals foundation models
BiosignalsLanguage model

SensorLM

Google Research / Google DeepMind / University of Cambridge

A family of sensor-language foundation models from Google that aligns wearable biosignals with natural language for zero-shot recognition, retrieval, and captioning.

Released: June 2025
Parameters: 1.3 Billion

SensorLM is a family of sensor-language foundation models developed by Google Research, Google DeepMind, and the University of Cambridge to make the continuous biosignals collected by consumer wearables interpretable through natural language. Wearable devices such as Fitbit and Pixel Watch produce dense, multi-channel physiological streams—heart rate, motion, skin temperature, electrodermal activity—that are difficult to label and have historically required task-specific models for each downstream question. SensorLM addresses this gap by jointly pretraining a sensor encoder and a text model so that sensor segments and their textual descriptions live in a shared representation space.

The central obstacle to building such a model is the near-total absence of paired sensor–text data: raw wearable streams almost never come with descriptive captions. The authors overcome this with a hierarchical caption-generation pipeline that algorithmically derives statistical, structural, and semantic descriptions from the signals themselves, allowing them to assemble what they report as the largest sensor-language dataset to date. First posted to arXiv in June 2025 and presented at NeurIPS 2025, SensorLM extends established multimodal recipes (CLIP and CoCa) from images to physiological time series.

By learning the "language" of wearable sensors, SensorLM enables zero-shot and few-shot interpretation of health and activity signals without bespoke labeled datasets, positioning it alongside vision-language models as a multimodal foundation model for the biosignal domain.

#Key Features

  • Paired sensor–language pretraining: The model aligns minutely-resolution wearable signals with text via combined contrastive and generative objectives, mirroring CLIP-style matching and CoCa-style caption generation in a single framework.
  • Hierarchical caption pipeline: An automated pipeline produces statistical, structural, and semantic captions directly from sensor streams, solving the scarcity of human-written sensor descriptions and enabling dataset curation at scale.
  • Zero- and few-shot capability: SensorLM performs activity recognition, cross-modal retrieval, and sensor captioning without task-specific fine-tuning, and adapts quickly from a handful of labeled examples.
  • Generative captioning: The model can describe a sensor segment in coherent text, reportedly with greater factual accuracy than generic language models prompted with raw signals.
  • Model family with clear scaling: Four sizes (3M to 1.27B parameters) let practitioners trade compute for accuracy, with the authors documenting consistent scaling behavior across the family.

#Technical Details

SensorLM encodes a 26-feature by 1440-minute daily sensor matrix—derived from five modalities (PPG, accelerometer, electrodermal activity, skin temperature, and altimeter)—using a ViT-2D backbone with a (2, 10) patch size that yields roughly 1,872 tokens. This sensor encoder is paired with text encoder and multimodal decoder components and trained under both contrastive and generative losses. The family spans four variants: SensorLM-S (3M-parameter sensor encoder), SensorLM-B (114M), SensorLM-L (404M), and SensorLM-XL (1.27B). Pretraining used 59.7 million hours of de-identified, consented data from 103,643 individuals across 127 countries, collected over a two-month window in 2024. Across zero-shot recognition, few-shot learning, and cross-modal retrieval benchmarks, SensorLM outperforms strong baselines and exhibits the scaling, label efficiency, and zero-shot generalization to unseen tasks characteristic of large multimodal foundation models.

#Applications

SensorLM targets consumer and clinical health analytics where wearable data must be summarized, queried, or classified at scale. Potential uses include zero-shot recognition of physical activities and physiological states, natural-language search over sensor archives (retrieving segments that match a textual description or vice versa), automatic captioning of a day's signals for clinicians or users, and rapid bootstrapping of new health classifiers from few labeled examples. Researchers studying digital biomarkers and population health benefit from a model that generalizes across diverse devices, demographics, and geographies without per-task annotation.

#Impact

SensorLM demonstrates that the multimodal alignment paradigm underpinning vision-language models transfers to physiological time series, providing a template for connecting raw biosignals to interpretable language. By releasing benchmark results across an unprecedented 59.7-million-hour corpus and a documented model-scaling study, it establishes a reference point for the emerging field of wearable foundation models and lowers the barrier to label-efficient health analytics. Its principal limitations are that the underlying training corpus and model weights are not publicly released, constraining external reproduction, and that performance on clinically actionable endpoints beyond activity and physiological-state recognition remains to be validated prospectively.

Citation

SensorLM: Learning the Language of Wearable Sensors

Preprint

Zhang, Y., et al. (2025) SensorLM: Learning the Language of Wearable Sensors. arXiv.org.

DOI: 10.48550/arXiv.2506.09108

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations47
Influential2
References50

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
41Partial
Usability — can I run it?35
Reproducibility — can I retrain it?39
Model Openness Framework
Unclassified
Missing required components

Tags

activity_recognitioncontrastive_learningcross_modal_retrievalfoundation_modelmultimodalphysiologysensor_captioningtransformervision_transformerwearable_sensorszero_shot

Resources

Research PaperOfficial Website