bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Metabolomics foundation models
MetabolomicsSmall molecule

DreaMS

IOCB Prague / MIT

A 116M-parameter self-supervised transformer pretrained on millions of tandem mass spectra that produces general-purpose molecular embeddings for spectral annotation and property prediction.

Released: May 2025
Parameters: 116 Million

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a self-supervised transformer that learns general-purpose molecular representations directly from tandem mass spectrometry (MS/MS) data. Untargeted metabolomics experiments routinely generate enormous numbers of fragmentation spectra, yet the vast majority remain unannotated because reference libraries cover only a small fraction of the chemical universe. DreaMS reframes this problem the way language models reframed natural language: rather than relying on hand-built rules or scarce labeled examples, it pretrains on millions of unannotated spectra to build a representation that transfers across many downstream metabolomics tasks.

The model was developed by Roman Bushuiev and colleagues in the Pluskal lab at the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences (IOCB Prague), together with collaborators at MIT, and published in Nature Biotechnology in May 2025. It is the first mass-spectrometry foundation model to demonstrate that a single pretrained backbone can deliver state-of-the-art results across spectral annotation, molecular fingerprint prediction, and chemical property prediction without task-specific architectures.

By treating MS/MS spectra as the native input modality—rather than first projecting them onto a molecular structure—DreaMS provides a reusable embedding layer for the metabolomics community, analogous to the role ESM embeddings play for proteins.

#Key Features

  • Self-supervised pretraining on unlabeled spectra: DreaMS learns from millions of unannotated MS/MS spectra, removing the dependence on costly, library-limited structural annotations that constrains supervised approaches.
  • Dual pretraining objectives: The model is trained to predict masked spectral peaks and to recover chromatographic retention orders, two complementary signals that force it to capture chemically meaningful structure.
  • General-purpose 1024-dimensional embeddings: A single frozen or fine-tuned representation supports spectral similarity, fingerprint prediction, property prediction (including fluorine detection), and molecular networking.
  • DreaMS Atlas: The authors released a molecular network of roughly 201 million MS/MS spectra annotated with DreaMS representations, enabling exploration of the unannotated metabolome at scale.
  • Fully open release: Code, pretrained weights, documentation, and the underlying GeMS dataset are publicly available under an MIT license.

#Technical Details

DreaMS is a transformer with approximately 116 million parameters that operates directly on peak lists from MS/MS spectra. It was pretrained on the GeMS (GNPS Experimental Mass Spectra) dataset, a corpus of unannotated spectra mined from the MassIVE/GNPS repository, with roughly 24 million curated high-quality spectra used for the core pretraining. The self-supervised objective combines masked-peak prediction—analogous to masked language modeling—with prediction of chromatographic retention orders, yielding 1024-dimensional embeddings. For downstream tasks the backbone is fine-tuned on smaller labeled sets (for example, spectra paired with molecular structures from MoNA). Across spectral annotation, molecular fingerprint prediction, and chemical property prediction benchmarks, DreaMS achieved state-of-the-art performance, and its embeddings improved spectral similarity and library matching over established fragmentation-based metrics.

#Applications

DreaMS targets researchers in untargeted metabolomics, natural product discovery, environmental and exposomics analysis, and any field that relies on interpreting large volumes of MS/MS data. Its embeddings can be dropped into existing workflows to improve spectral library matching, prioritize candidate structures, predict molecular fingerprints and properties, and build molecular networks that cluster related compounds. The released DreaMS Atlas lets analysts place their own spectra in the context of hundreds of millions of community spectra, surfacing potential novel metabolites that conventional library search would miss.

#Impact

DreaMS demonstrates that the foundation-model paradigm transfers cleanly to mass spectrometry, providing the metabolomics field with a reusable, pretrained representation rather than a collection of bespoke task-specific models. Its fully open release—code, weights, documentation, the GeMS training data, and the 201 million-spectrum DreaMS Atlas—lowers the barrier for downstream method development and reproducibility. As a Nature Biotechnology publication with broad community resources, it is positioned to become a standard embedding backbone for spectral annotation, much as protein language models became standard for sequence analysis. Key limitations include reliance on GNPS-derived spectra, which may bias coverage toward well-studied compound classes, and the usual need for task-specific fine-tuning to reach peak performance on specialized prediction problems.

Citation

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS

Bushuiev, R., et al. (2025) Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology.

DOI: 10.1038/s41587-025-02663-3

Recent citations

Papers that recently cited this model.

  • Benchmarking MS/MS Featurization Strategies for Machine Learning-Driven Metabolite Structure Annotation.

    Roger Giné, Iván Pérez-López, Josep M. Badia, et al.

    Journal of the American Society for Mass Spectrometry · Jun 2026

    0
  • Why machine learning fails at mass spectrometry for small molecules.

    Ling Min Serena Khoo, R. Barzilay

    Nature Metabolism · Jun 2026

    0
  • Metabolism as the biochemical language for mechanistic biomedical AI.

    Li Li, Yujue Wang, Chengju Luo, et al.

    Nature Metabolism · Jun 2026

    0

Top citations

The most-cited papers that cite this model.

  • Advancing materials discovery through artificial intelligence

    Martin Otyepka, M. Pykal, Michal Otyepka

    Applied Materials Today · Dec 2025

    20
  • A Perspective on Unintentional Fragments and Their Impact on the Dark Metabolome, Untargeted Profiling, Molecular Networking, Public Data, and Repository Scale Analysis

    Yasin El Abiead, Ipsita Mohanty, Shipei Xing, et al.

    JACS Au · Dec 2025

    11
  • Illuminating the universe of enzyme catalysis in the era of artificial intelligence.

    Jason Yang, Francesca-Zhoufan Li, Yueming Long, et al.

    Cell Systems · Aug 2025

    9
  • A versatile toolkit for drug metabolism studies with GNPS2: from drug development to clinical monitoring.

    J. Yu, Young Beom Kwak, Kyung Hwa Kee, et al.

    Nature Protocols · Sep 2025

    7
  • An evaluation methodology for machine learning-based tandem mass spectra similarity prediction

    Michael Strobel, Alberto Gil-de-la-Fuente, Mohammad Reza Zare Shahneh, et al.

    BMC Bioinformatics · Jul 2025

    7

Citations

Total Citations57
Influential3
References64

GitHub

Stars187
Forks46
Open Issues8
Contributors7
Last Push19d ago
LanguageJupyter Notebook
LicenseMIT

Fields of citing research

  • Chemistry77%
  • Medicine72%
  • Computer Science65%
  • Biology35%
  • Environmental Science18%
  • Materials Science5%
  • Engineering4%
  • Physics4%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible
98Open
Usability — can I run it?100
Reproducibility — can I retrain it?92
Model Openness Framework
Class II
Open Tooling

Resources

GitHub RepositoryResearch PaperDocumentationDataset