Adapts a pretrained large language model via LLM2Vec to derive discriminative spectral embeddings for compound identification in mass spectrometry, reaching 66.3% Recall@1 on NIST23.
LLM4MS is a method for compound identification in mass spectrometry that repurposes a general-purpose large language model to generate discriminative embeddings of tandem mass (MS/MS) spectra. Identifying an unknown compound from its fragmentation spectrum is a core task in metabolomics and small-molecule analysis, and the dominant strategy is spectral library search: encode a query spectrum as a vector and rank candidate reference spectra by similarity. The quality of that vector representation largely determines identification accuracy, and prior embedding methods such as Spec2Vec were trained from scratch on spectral data alone.
The central idea of LLM4MS is that the latent knowledge already captured by a pretrained LLM can be transferred to the spectral domain. Mass spectra are first textualized into a token sequence the model can read, and the LLM2Vec recipe is used to convert the causal, decoder-style LLM into an encoder that produces a single embedding per spectrum. The model is then refined so that spectra from structurally similar compounds map to nearby points in embedding space, enabling nearest-neighbor retrieval against a reference library.
The work was developed by Yang Xu, Yixiao Ma, Weijie Xu, Zuliang Yang, and Kai Ming Ting at the School of Artificial Intelligence, Nanjing University, and published in Communications Chemistry on November 4, 2025. It sits at the boundary between general-purpose language modeling and analytical chemistry, demonstrating that an off-the-shelf LLM backbone can be adapted into a fast, accurate spectral encoder.
LLM4MS adapts a pretrained transformer-based large language model using LLM2Vec, a procedure that converts a decoder-only LLM into a text encoder. Mass spectra are textualized into token sequences so the model can interpret their syntax and semantic content, and the model is then specialized to spectral data. Refinement uses Tanimoto scores computed between the molecular structures of spectrum pairs as supervision, pulling embeddings of structurally similar compounds together. Evaluation was performed against a million-scale open-source in-silico library with NIST23 as the held-out test set, where LLM4MS achieved 66.3% Recall@1 and 92.7% Recall@10 versus the prior best Spec2Vec, alongside a throughput of nearly 15,000 queries per second. The published version of record is released under a CC BY-NC-ND 4.0 license.
LLM4MS targets compound and metabolite identification workflows that rely on spectral library search, including untargeted metabolomics, natural-product discovery, and small-molecule profiling. Analysts and computational chemists benefit from higher top-1 retrieval accuracy without sacrificing speed: the embedding-plus-nearest-neighbor design scales to very large reference libraries, making it suitable for high-throughput screening pipelines where thousands of query spectra must be annotated quickly.
LLM4MS provides concrete evidence that general-purpose large language models can be transferred into analytical-chemistry tasks, recasting spectral identification as a representation-learning problem solved with a reused LLM backbone rather than a bespoke from-scratch model. Its accuracy gains over Spec2Vec and its high query throughput make it an appealing baseline for future spectral-embedding research. Notable limitations should temper expectations: at the time of review no public code repository, model card, or data card was located, and the CC BY-NC-ND license restricts commercial use and derivative works. Because the approach adapts an existing general LLM as a largely fixed backbone applied to new spectral inputs, its status as a standalone "foundation model" is borderline; it is better understood as a transfer-learning method built on top of one.
Xu, Y., et al. (2025) A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry. Communications Chemistry.
DOI: 10.1038/s42004-025-01708-7Papers that recently cited this model.
Seunghyun Yoo, Sanghong Kim, Namkyung Yoon, et al.
arXiv.org · Jan 2026
The most-cited papers that cite this model.
Seunghyun Yoo, Sanghong Kim, Namkyung Yoon, et al.
arXiv.org · Jan 2026
Share of papers citing this model.