DreaMS

A 116M-parameter self-supervised transformer pretrained on millions of tandem mass spectra that produces general-purpose molecular embeddings for spectral annotation and property prediction.

Released: May 2025

Parameters: 116 Million

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a self-supervised transformer that learns general-purpose molecular representations directly from tandem mass spectrometry (MS/MS) data. Untargeted metabolomics experiments routinely generate enormous numbers of fragmentation spectra, yet the vast majority remain unannotated because reference libraries cover only a small fraction of the chemical universe. DreaMS reframes this problem the way language models reframed natural language: rather than relying on hand-built rules or scarce labeled examples, it pretrains on millions of unannotated spectra to build a representation that transfers across many downstream metabolomics tasks.

The model was developed by Roman Bushuiev and colleagues in the Pluskal lab at the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences (IOCB Prague), together with collaborators at MIT, and published in Nature Biotechnology in May 2025. It is the first mass-spectrometry foundation model to demonstrate that a single pretrained backbone can deliver state-of-the-art results across spectral annotation, molecular fingerprint prediction, and chemical property prediction without task-specific architectures.

By treating MS/MS spectra as the native input modality—rather than first projecting them onto a molecular structure—DreaMS provides a reusable embedding layer for the metabolomics community, analogous to the role ESM embeddings play for proteins.

Key Features

Self-supervised pretraining on unlabeled spectra: DreaMS learns from millions of unannotated MS/MS spectra, removing the dependence on costly, library-limited structural annotations that constrains supervised approaches.
Dual pretraining objectives: The model is trained to predict masked spectral peaks and to recover chromatographic retention orders, two complementary signals that force it to capture chemically meaningful structure.
General-purpose 1024-dimensional embeddings: A single frozen or fine-tuned representation supports spectral similarity, fingerprint prediction, property prediction (including fluorine detection), and molecular networking.
DreaMS Atlas: The authors released a molecular network of roughly 201 million MS/MS spectra annotated with DreaMS representations, enabling exploration of the unannotated metabolome at scale.
Fully open release: Code, pretrained weights, documentation, and the underlying GeMS dataset are publicly available under an MIT license.

Technical Details

DreaMS is a transformer with approximately 116 million parameters that operates directly on peak lists from MS/MS spectra. It was pretrained on the GeMS (GNPS Experimental Mass Spectra) dataset, a corpus of unannotated spectra mined from the MassIVE/GNPS repository, with roughly 24 million curated high-quality spectra used for the core pretraining. The self-supervised objective combines masked-peak prediction—analogous to masked language modeling—with prediction of chromatographic retention orders, yielding 1024-dimensional embeddings. For downstream tasks the backbone is fine-tuned on smaller labeled sets (for example, spectra paired with molecular structures from MoNA). Across spectral annotation, molecular fingerprint prediction, and chemical property prediction benchmarks, DreaMS achieved state-of-the-art performance, and its embeddings improved spectral similarity and library matching over established fragmentation-based metrics.

Applications

DreaMS targets researchers in untargeted metabolomics, natural product discovery, environmental and exposomics analysis, and any field that relies on interpreting large volumes of MS/MS data. Its embeddings can be dropped into existing workflows to improve spectral library matching, prioritize candidate structures, predict molecular fingerprints and properties, and build molecular networks that cluster related compounds. The released DreaMS Atlas lets analysts place their own spectra in the context of hundreds of millions of community spectra, surfacing potential novel metabolites that conventional library search would miss.

Impact

DreaMS demonstrates that the foundation-model paradigm transfers cleanly to mass spectrometry, providing the metabolomics field with a reusable, pretrained representation rather than a collection of bespoke task-specific models. Its fully open release—code, weights, documentation, the GeMS training data, and the 201 million-spectrum DreaMS Atlas—lowers the barrier for downstream method development and reproducibility. As a Nature Biotechnology publication with broad community resources, it is positioned to become a standard embedding backbone for spectral annotation, much as protein language models became standard for sequence analysis. Key limitations include reliance on GNPS-derived spectra, which may bias coverage toward well-studied compound classes, and the usual need for task-specific fine-tuning to reach peak performance on specialized prediction problems.

Citation

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

Bushuiev, R., et al. (2025) Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology.

DOI: 10.1038/s41587-025-02663-3

Recent citations

Papers that recently cited this model.

Benchmarking MS/MS Featurization Strategies for Machine Learning-Driven Metabolite Structure Annotation.
Roger Giné, Iván Pérez-López, Josep M. Badia, et al.
Journal of the American Society for Mass Spectrometry · Jun 2026
0
Why machine learning fails at mass spectrometry for small molecules.
Ling Min Serena Khoo, R. Barzilay
Nature Metabolism · Jun 2026
0
Metabolism as the biochemical language for mechanistic biomedical AI.
Li Li, Yujue Wang, Chengju Luo, et al.
Nature Metabolism · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Advancing materials discovery through artificial intelligence
Martin Otyepka, M. Pykal, Michal Otyepka
Applied Materials Today · Dec 2025
20
A Perspective on Unintentional Fragments and Their Impact on the Dark Metabolome, Untargeted Profiling, Molecular Networking, Public Data, and Repository Scale Analysis
Yasin El Abiead, Ipsita Mohanty, Shipei Xing, et al.
JACS Au · Dec 2025
11
Illuminating the universe of enzyme catalysis in the era of artificial intelligence.
Jason Yang, Francesca-Zhoufan Li, Yueming Long, et al.
Cell Systems · Aug 2025
9
A versatile toolkit for drug metabolism studies with GNPS2: from drug development to clinical monitoring.
J. Yu, Young Beom Kwak, Kyung Hwa Kee, et al.
Nature Protocols · Sep 2025
7
An evaluation methodology for machine learning-based tandem mass spectra similarity prediction
Michael Strobel, Alberto Gil-de-la-Fuente, Mohammad Reza Zare Shahneh, et al.
BMC Bioinformatics · Jul 2025
7

Citations

Total Citations57

Influential3

References64

GitHub

Stars187

Forks46

Open Issues8

Contributors7

Last Push19d ago

LanguageJupyter Notebook

LicenseMIT

Fields of citing research

Chemistry77%
Medicine72%
Computer Science65%
Biology35%
Environmental Science18%
Materials Science5%
Engineering4%
Physics4%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

98Open

Usability — can I run it?100

Reproducibility — can I retrain it?92

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper Documentation Dataset

Key Features

Self-supervised pretraining on unlabeled spectra: DreaMS learns from millions of unannotated MS/MS spectra, removing the dependence on costly, library-limited structural annotations that constrains supervised approaches.

Dual pretraining objectives: The model is trained to predict masked spectral peaks and to recover chromatographic retention orders, two complementary signals that force it to capture chemically meaningful structure.

General-purpose 1024-dimensional embeddings: A single frozen or fine-tuned representation supports spectral similarity, fingerprint prediction, property prediction (including fluorine detection), and molecular networking.

DreaMS Atlas: The authors released a molecular network of roughly 201 million MS/MS spectra annotated with DreaMS representations, enabling exploration of the unannotated metabolome at scale.

Fully open release: Code, pretrained weights, documentation, and the underlying GeMS dataset are publicly available under an MIT license.

Technical Details

Applications

Impact

Citation

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

Bushuiev, R., et al. (2025) Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology.

DOI: 10.1038/s41587-025-02663-3

DreaMS

#Key Features

#Technical Details

#Applications

#Impact

Citation

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Resources

DreaMS

#Key Features

#Technical Details

#Applications

#Impact

Citation

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact