Open respiratory acoustic foundation models pretrained on ~136K curated cough and breathing recordings for health tasks such as disease detection and lung function estimation.
Respiratory sounds — coughs, breaths, and exhalations — carry clinically useful information about lung and airway health, and the proliferation of smartphones and wearables has made such audio cheap to collect at scale. Yet most respiratory audio models are trained from scratch on small, narrowly labelled datasets for a single task (for example COVID-19 screening), which limits their accuracy and generalisability. OPERA (OPEn Respiratory Acoustic foundation models) tackles this by pretraining general-purpose encoders on large volumes of unlabelled respiratory audio, producing reusable representations that can be adapted to many downstream health tasks.
OPERA was developed by Yuwei Zhang, Tong Xia, Jing Han, Cecilia Mascolo and colleagues in the Mobile Systems group at the University of Cambridge, and presented at NeurIPS 2024 (preprint June 2024). Beyond releasing models, the authors contribute an open framework: a curated pretraining corpus aggregated from public respiratory-audio sources, three pretrained foundation models, and a benchmark of 19 downstream health tasks for standardised evaluation.
The project is deliberately open — code, curated data pipelines, pretrained checkpoints, and the evaluation suite are all released — to give the respiratory health community a common, reproducible starting point rather than a collection of isolated, task-specific models.
OPERA curates roughly 136,000 respiratory audio samples totalling about 440 hours of cough and breathing recordings, drawn from public sources including COVID-19 Sounds, UK COVID-19, CoughVID, ICBHI, HF Lung, Coswara, KAUH, and others. Three encoders are pretrained with self-supervised objectives: OPERA-CT and OPERA-CE use contrastive learning (transformer and efficient-CNN backbones, respectively), while OPERA-GT is a generative transformer autoencoder. The encoders operate on spectrogram inputs of fixed-length audio segments (around 8 seconds) and produce 768-dimensional feature embeddings used for downstream linear probing and fine-tuning. Across the 19-task benchmark, the OPERA models surpass general-audio foundation models (such as those pretrained on AudioSet) on 16 tasks, with contrastive and generative variants showing complementary strengths across classification and regression endpoints. Parameter counts for the individual encoders are not stated in the paper.
OPERA targets researchers and developers building respiratory health tools from acoustic data, particularly in mobile and remote-monitoring settings where audio can be captured passively on consumer devices. The pretrained encoders can be adapted — typically via lightweight linear probing or fine-tuning — to tasks such as COVID-19 and COPD detection, smoker classification, and lung-function estimation, lowering the data and compute barrier for groups without large labelled cohorts. The accompanying benchmark also serves as a shared yardstick for comparing new respiratory-audio methods.
OPERA is among the first openly released foundation-model efforts dedicated to respiratory acoustics, and it establishes both reusable encoders and a common benchmark for a field that had been fragmented across bespoke, single-task models. By demonstrating that domain-specific pretraining beats general-audio models on most health tasks and generalises to unseen data, it strengthens the case for specialised audio foundation models in health. A practical caveat is the licensing split: the code is released under the permissive MIT licence, but the pretrained weights on Hugging Face are CC-BY-NC-4.0, restricting their use to non-commercial purposes. Evaluation also remains observational, requiring prospective clinical validation before deployment.
Zhang, Y., et al. (2024) Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking. Neural Information Processing Systems.
DOI: 10.48550/arXiv.2406.16148Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data