A 116M-parameter self-supervised transformer pretrained on millions of tandem mass spectra that produces general-purpose molecular embeddings for spectral annotation and property prediction.
DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a self-supervised transformer that learns general-purpose molecular representations directly from tandem mass spectrometry (MS/MS) data. Untargeted metabolomics experiments routinely generate enormous numbers of fragmentation spectra, yet the vast majority remain unannotated because reference libraries cover only a small fraction of the chemical universe. DreaMS reframes this problem the way language models reframed natural language: rather than relying on hand-built rules or scarce labeled examples, it pretrains on millions of unannotated spectra to build a representation that transfers across many downstream metabolomics tasks.
The model was developed by Roman Bushuiev and colleagues in the Pluskal lab at the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences (IOCB Prague), together with collaborators at MIT, and published in Nature Biotechnology in May 2025. It is the first mass-spectrometry foundation model to demonstrate that a single pretrained backbone can deliver state-of-the-art results across spectral annotation, molecular fingerprint prediction, and chemical property prediction without task-specific architectures.
By treating MS/MS spectra as the native input modality—rather than first projecting them onto a molecular structure—DreaMS provides a reusable embedding layer for the metabolomics community, analogous to the role ESM embeddings play for proteins.
DreaMS is a transformer with approximately 116 million parameters that operates directly on peak lists from MS/MS spectra. It was pretrained on the GeMS (GNPS Experimental Mass Spectra) dataset, a corpus of unannotated spectra mined from the MassIVE/GNPS repository, with roughly 24 million curated high-quality spectra used for the core pretraining. The self-supervised objective combines masked-peak prediction—analogous to masked language modeling—with prediction of chromatographic retention orders, yielding 1024-dimensional embeddings. For downstream tasks the backbone is fine-tuned on smaller labeled sets (for example, spectra paired with molecular structures from MoNA). Across spectral annotation, molecular fingerprint prediction, and chemical property prediction benchmarks, DreaMS achieved state-of-the-art performance, and its embeddings improved spectral similarity and library matching over established fragmentation-based metrics.
DreaMS targets researchers in untargeted metabolomics, natural product discovery, environmental and exposomics analysis, and any field that relies on interpreting large volumes of MS/MS data. Its embeddings can be dropped into existing workflows to improve spectral library matching, prioritize candidate structures, predict molecular fingerprints and properties, and build molecular networks that cluster related compounds. The released DreaMS Atlas lets analysts place their own spectra in the context of hundreds of millions of community spectra, surfacing potential novel metabolites that conventional library search would miss.
DreaMS demonstrates that the foundation-model paradigm transfers cleanly to mass spectrometry, providing the metabolomics field with a reusable, pretrained representation rather than a collection of bespoke task-specific models. Its fully open release—code, weights, documentation, the GeMS training data, and the 201 million-spectrum DreaMS Atlas—lowers the barrier for downstream method development and reproducibility. As a Nature Biotechnology publication with broad community resources, it is positioned to become a standard embedding backbone for spectral annotation, much as protein language models became standard for sequence analysis. Key limitations include reliance on GNPS-derived spectra, which may bias coverage toward well-studied compound classes, and the usual need for task-specific fine-tuning to reach peak performance on specialized prediction problems.
Bushuiev, R., et al. (2025) Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology.
DOI: 10.1038/s41587-025-02663-3Papers that recently cited this model.
Roger Giné, Iván Pérez-López, Josep M. Badia, et al.
Journal of the American Society for Mass Spectrometry · Jun 2026
Ling Min Serena Khoo, R. Barzilay
Nature Metabolism · Jun 2026
Li Li, Yujue Wang, Chengju Luo, et al.
Nature Metabolism · Jun 2026
The most-cited papers that cite this model.
Martin Otyepka, M. Pykal, Michal Otyepka
Applied Materials Today · Dec 2025
Yasin El Abiead, Ipsita Mohanty, Shipei Xing, et al.
JACS Au · Dec 2025
Jason Yang, Francesca-Zhoufan Li, Yueming Long, et al.
Cell Systems · Aug 2025
J. Yu, Young Beom Kwak, Kyung Hwa Kee, et al.
Nature Protocols · Sep 2025
Michael Strobel, Alberto Gil-de-la-Fuente, Mohammad Reza Zare Shahneh, et al.
BMC Bioinformatics · Jul 2025
Share of papers citing this model.