A transformer foundation model pretrained on millions of MS/MS spectra that improves identification of challenging isomers and produces embeddings for direct biological interpretation.
LSM-MS2 is a transformer-based foundation model for tandem mass spectrometry (MS/MS) developed by Matterworks, Inc. (Somerville, MA) and described in an arXiv preprint posted in October 2025. It builds on the company's earlier self-supervised model, LSM1-MS2, which appeared as a ChemRxiv preprint in February 2024. The model targets one of the central bottlenecks in metabolomics and small-molecule analysis: the vast majority of fragmentation spectra collected in untargeted experiments cannot be confidently matched to a chemical structure, leaving most signals "dark."
Rather than treating spectrum-to-structure matching as a lookup against a reference library, LSM-MS2 learns a continuous "semantic chemical space" by pretraining on millions of MS/MS spectra. In this learned embedding space, spectra from chemically related molecules sit near one another, which helps the model resolve compounds that are notoriously difficult to distinguish — most notably isomers that share a molecular formula but differ in structure and produce highly similar fragmentation patterns. The same embeddings double as a general-purpose representation of a sample's chemical state, enabling biological and disease-state interpretation without a separate, task-specific model.
LSM-MS2 sits at the intersection of metabolomics and small-molecule cheminformatics, extending the pretrain-then-apply foundation-model paradigm (now common in protein and single-cell biology) to raw mass-spectral data. It is a commercial model: the authors are Matterworks employees, and the work involves patented technology, with inference offered through the company's Pyxis platform rather than as open code or weights.
LSM-MS2 is described as a transformer-based foundation model pretrained self-supervised on millions of MS/MS spectra to produce a chemically meaningful embedding representation; the preprint does not disclose the precise architecture, tokenization of spectra, pretraining loss, or parameter count. Evaluation draws on a reference library of roughly 1.8 million spectra. Reported results include approximately a 30% improvement in identifying challenging isomers and a 42% increase in correct identifications in complex biological samples. Downstream biological tasks demonstrated from the embeddings include antipsychotic-overdose classification in mice, septic-shock prediction in emergency-department patients (macro F1 of 0.80), and cystic-fibrosis detection via unsupervised clustering. No public code or pretrained weights are released; inference is available only through the commercial Pyxis platform.
LSM-MS2 is aimed at metabolomics, clinical mass spectrometry, and small-molecule analytics, where untargeted MS/MS experiments routinely generate far more spectra than can be annotated. By improving isomer discrimination and identification in complex matrices, it can increase the fraction of usable signal in metabolomic surveys, biomarker discovery, drug-metabolism studies, and toxicology. Because the same embeddings feed disease-state classification, the model is positioned for translational workflows — for example, stratifying patients or predicting clinical outcomes from a sample's spectral fingerprint with minimal task-specific labeling. Access is through Matterworks' Pyxis platform, so the primary beneficiaries are laboratories adopting that commercial pipeline.
LSM-MS2 demonstrates that the foundation-model paradigm can be applied to raw tandem mass spectra, learning a transferable chemical embedding that improves hard identification problems and doubles as a substrate for biological interpretation. This is a meaningful direction for a field where most collected spectra remain unannotated, and the reported isomer and complex-matrix gains, if they hold up under independent evaluation, would address a long-standing pain point. Its significance is tempered by openness and maturity caveats: it is a preprint that has not been peer-reviewed; the architecture, pretraining-data volume, and parameter count are undisclosed; the work covers patented technology; and there is no public code or weights, with use gated behind the commercial Pyxis platform. Broader scientific adoption will depend on independent benchmarking and on how accessible the model becomes outside that platform.
Asher, G., et al. (2025) LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation. arXiv.org.
DOI: 10.48550/arXiv.2510.26715Asher, G., et al. (2024) LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching. American Chemical Society (ACS).
DOI: 10.26434/chemrxiv-2024-k06gbPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data