LSM-MS2 is a transformer-based foundation model for tandem mass spectrometry (MS/MS) developed by Matterworks, Inc. (Somerville, MA) and described in an arXiv preprint posted in October 2025. It builds on the company's earlier self-supervised model, LSM1-MS2, which appeared as a ChemRxiv preprint in February 2024. The model targets one of the central bottlenecks in metabolomics and small-molecule analysis: the vast majority of fragmentation spectra collected in untargeted experiments cannot be confidently matched to a chemical structure, leaving most signals "dark."

Rather than treating spectrum-to-structure matching as a lookup against a reference library, LSM-MS2 learns a continuous "semantic chemical space" by pretraining on millions of MS/MS spectra. In this learned embedding space, spectra from chemically related molecules sit near one another, which helps the model resolve compounds that are notoriously difficult to distinguish — most notably isomers that share a molecular formula but differ in structure and produce highly similar fragmentation patterns. The same embeddings double as a general-purpose representation of a sample's chemical state, enabling biological and disease-state interpretation without a separate, task-specific model.

LSM-MS2 sits at the intersection of metabolomics and small-molecule cheminformatics, extending the pretrain-then-apply foundation-model paradigm (now common in protein and single-cell biology) to raw mass-spectral data. It is a commercial model: the authors are Matterworks employees, and the work involves patented technology, with inference offered through the company's Pyxis platform rather than as open code or weights.

Key Features

Semantic spectral embeddings: Pretraining maximizes separation in spectral space so that chemically related spectra cluster together, yielding embeddings that support both identification and downstream interpretation.
Improved isomer resolution: The model reports roughly a 30% improvement in correctly identifying challenging isomeric compounds relative to conventional spectral-matching approaches.
Gains in complex matrices: In complex biological samples, the authors report a 42% increase in correct identifications, with performance maintained at low analyte concentrations.
Direct biological interpretation: Spectral embeddings are used directly for disease-state differentiation and clinical-outcome tasks, reducing the labeled data needed for each new question.
Foundation-model reuse: A single pretrained backbone serves both annotation and biological-readout tasks, mirroring the foundation-model approach used in other areas of computational biology.

Technical Details

LSM-MS2 is described as a transformer-based foundation model pretrained self-supervised on millions of MS/MS spectra to produce a chemically meaningful embedding representation; the preprint does not disclose the precise architecture, tokenization of spectra, pretraining loss, or parameter count. Evaluation draws on a reference library of roughly 1.8 million spectra. Reported results include approximately a 30% improvement in identifying challenging isomers and a 42% increase in correct identifications in complex biological samples. Downstream biological tasks demonstrated from the embeddings include antipsychotic-overdose classification in mice, septic-shock prediction in emergency-department patients (macro F1 of 0.80), and cystic-fibrosis detection via unsupervised clustering. No public code or pretrained weights are released; inference is available only through the commercial Pyxis platform.

Applications

LSM-MS2 is aimed at metabolomics, clinical mass spectrometry, and small-molecule analytics, where untargeted MS/MS experiments routinely generate far more spectra than can be annotated. By improving isomer discrimination and identification in complex matrices, it can increase the fraction of usable signal in metabolomic surveys, biomarker discovery, drug-metabolism studies, and toxicology. Because the same embeddings feed disease-state classification, the model is positioned for translational workflows — for example, stratifying patients or predicting clinical outcomes from a sample's spectral fingerprint with minimal task-specific labeling. Access is through Matterworks' Pyxis platform, so the primary beneficiaries are laboratories adopting that commercial pipeline.

Impact

LSM-MS2 demonstrates that the foundation-model paradigm can be applied to raw tandem mass spectra, learning a transferable chemical embedding that improves hard identification problems and doubles as a substrate for biological interpretation. This is a meaningful direction for a field where most collected spectra remain unannotated, and the reported isomer and complex-matrix gains, if they hold up under independent evaluation, would address a long-standing pain point. Its significance is tempered by openness and maturity caveats: it is a preprint that has not been peer-reviewed; the architecture, pretraining-data volume, and parameter count are undisclosed; the work covers patented technology; and there is no public code or weights, with use gated behind the commercial Pyxis platform. Broader scientific adoption will depend on independent benchmarking and on how accessible the model becomes outside that platform.

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

Preprint

Asher, G., et al. (2025) LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation. arXiv.org.

DOI: 10.48550/arXiv.2510.26715

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Asher, G., et al. (2024) LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching. American Chemical Society (ACS).

DOI: 10.26434/chemrxiv-2024-k06gb

Key Features

Semantic spectral embeddings: Pretraining maximizes separation in spectral space so that chemically related spectra cluster together, yielding embeddings that support both identification and downstream interpretation.

Improved isomer resolution: The model reports roughly a 30% improvement in correctly identifying challenging isomeric compounds relative to conventional spectral-matching approaches.

Gains in complex matrices: In complex biological samples, the authors report a 42% increase in correct identifications, with performance maintained at low analyte concentrations.

Direct biological interpretation: Spectral embeddings are used directly for disease-state differentiation and clinical-outcome tasks, reducing the labeled data needed for each new question.

Foundation-model reuse: A single pretrained backbone serves both annotation and biological-readout tasks, mirroring the foundation-model approach used in other areas of computational biology.

Technical Details

Applications

Impact

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

Preprint

Asher, G., et al. (2025) LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation. arXiv.org.

DOI: 10.48550/arXiv.2510.26715

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

DOI: 10.26434/chemrxiv-2024-k06gb

LSM-MS2

Key Features

Technical Details

Applications

Impact

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Recent citations

Top citations

Citations

Fields of citing research

Openness

Resources

LSM-MS2

Key Features

Technical Details

Applications

Impact

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Recent citations

Top citations

Citations

Fields of citing research

Openness

Resources

LSM-MS2

#Key Features

#Technical Details

#Applications

#Impact

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Recent citations

Top citations

Citations

Fields of citing research

Openness

Resources

LSM-MS2

#Key Features

#Technical Details

#Applications

#Impact

Citations

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

LSM1-MS2: A Self-Supervised Foundation Model for Tandem Mass Spectrometry Applications, Encompassing Extensive Chemical Property Predictions and Spectral Matching

Recent citations

Top citations

Citations

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact