LLM4MS is a method for compound identification in mass spectrometry that repurposes a general-purpose large language model to generate discriminative embeddings of tandem mass (MS/MS) spectra. Identifying an unknown compound from its fragmentation spectrum is a core task in metabolomics and small-molecule analysis, and the dominant strategy is spectral library search: encode a query spectrum as a vector and rank candidate reference spectra by similarity. The quality of that vector representation largely determines identification accuracy, and prior embedding methods such as Spec2Vec were trained from scratch on spectral data alone.

The central idea of LLM4MS is that the latent knowledge already captured by a pretrained LLM can be transferred to the spectral domain. Mass spectra are first textualized into a token sequence the model can read, and the LLM2Vec recipe is used to convert the causal, decoder-style LLM into an encoder that produces a single embedding per spectrum. The model is then refined so that spectra from structurally similar compounds map to nearby points in embedding space, enabling nearest-neighbor retrieval against a reference library.

The work was developed by Yang Xu, Yixiao Ma, Weijie Xu, Zuliang Yang, and Kai Ming Ting at the School of Artificial Intelligence, Nanjing University, and published in Communications Chemistry on November 4, 2025. It sits at the boundary between general-purpose language modeling and analytical chemistry, demonstrating that an off-the-shelf LLM backbone can be adapted into a fast, accurate spectral encoder.

Key Features

LLM-derived spectral embeddings: Applies the LLM2Vec transformation to a pretrained large language model, turning a text-oriented decoder into an encoder that emits a fixed-length embedding for each textualized mass spectrum.
Structure-aware refinement: Tanimoto similarity between the molecular structures behind spectrum pairs supplies the feedback signal used to fine-tune the embeddings, so that representation distance reflects chemical similarity.
State-of-the-art retrieval accuracy: Reaches 66.3% Recall@1 and 92.7% Recall@10 on the NIST23 test set, a 13.7 percentage-point Recall@1 improvement over the Spec2Vec baseline.
Ultra-fast matching: Once spectra are embedded, library search reduces to vector similarity, sustaining roughly 15,000 queries per second and making large-scale screening practical.
Transfer from a fixed backbone: Leverages knowledge already present in the general LLM rather than training a spectral model from scratch, illustrating a reuse-oriented path to domain-specific encoders.

Technical Details

LLM4MS adapts a pretrained transformer-based large language model using LLM2Vec, a procedure that converts a decoder-only LLM into a text encoder. Mass spectra are textualized into token sequences so the model can interpret their syntax and semantic content, and the model is then specialized to spectral data. Refinement uses Tanimoto scores computed between the molecular structures of spectrum pairs as supervision, pulling embeddings of structurally similar compounds together. Evaluation was performed against a million-scale open-source in-silico library with NIST23 as the held-out test set, where LLM4MS achieved 66.3% Recall@1 and 92.7% Recall@10 versus the prior best Spec2Vec, alongside a throughput of nearly 15,000 queries per second. The published version of record is released under a CC BY-NC-ND 4.0 license.

Applications

LLM4MS targets compound and metabolite identification workflows that rely on spectral library search, including untargeted metabolomics, natural-product discovery, and small-molecule profiling. Analysts and computational chemists benefit from higher top-1 retrieval accuracy without sacrificing speed: the embedding-plus-nearest-neighbor design scales to very large reference libraries, making it suitable for high-throughput screening pipelines where thousands of query spectra must be annotated quickly.

Impact

LLM4MS provides concrete evidence that general-purpose large language models can be transferred into analytical-chemistry tasks, recasting spectral identification as a representation-learning problem solved with a reused LLM backbone rather than a bespoke from-scratch model. Its accuracy gains over Spec2Vec and its high query throughput make it an appealing baseline for future spectral-embedding research. Notable limitations should temper expectations: at the time of review no public code repository, model card, or data card was located, and the CC BY-NC-ND license restricts commercial use and derivative works. Because the approach adapts an existing general LLM as a largely fixed backbone applied to new spectral inputs, its status as a standalone "foundation model" is borderline; it is better understood as a transfer-learning method built on top of one.

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Xu, Y., et al. (2025) A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry. Communications Chemistry.

DOI: 10.1038/s42004-025-01708-7

Key Features

LLM-derived spectral embeddings: Applies the LLM2Vec transformation to a pretrained large language model, turning a text-oriented decoder into an encoder that emits a fixed-length embedding for each textualized mass spectrum.

Structure-aware refinement: Tanimoto similarity between the molecular structures behind spectrum pairs supplies the feedback signal used to fine-tune the embeddings, so that representation distance reflects chemical similarity.

State-of-the-art retrieval accuracy: Reaches 66.3% Recall@1 and 92.7% Recall@10 on the NIST23 test set, a 13.7 percentage-point Recall@1 improvement over the Spec2Vec baseline.

Ultra-fast matching: Once spectra are embedded, library search reduces to vector similarity, sustaining roughly 15,000 queries per second and making large-scale screening practical.

Transfer from a fixed backbone: Leverages knowledge already present in the general LLM rather than training a spectral model from scratch, illustrating a reuse-oriented path to domain-specific encoders.

Technical Details

Applications

Impact

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Xu, Y., et al. (2025) A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry. Communications Chemistry.

DOI: 10.1038/s42004-025-01708-7

LLM4MS

Key Features

Technical Details

Applications

Impact

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Recent citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Top citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Citations

Fields of citing research

Openness

Resources

LLM4MS

Key Features

Technical Details

Applications

Impact

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Recent citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Top citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Citations

Fields of citing research

Openness

Resources

LLM4MS

#Key Features

#Technical Details

#Applications

#Impact

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Recent citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Top citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Citations

Fields of citing research

Openness

Resources

LLM4MS

#Key Features

#Technical Details

#Applications

#Impact

Citation

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry

Recent citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Top citations

Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Citations

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact