Fudan University / Shanghai AI Laboratory / Tsinghua University / Westlake University / Tongji University / Shanghai Innovation Institute / University of British Columbia / Zhejiang University / Stony Brook University
A unified, pretrained transformer for reference-free de novo sequencing of unmodified and modified peptides directly from tandem mass spectra.
OmniNovo is a pretrained, unified deep learning model for de novo peptide sequencing — the task of reading amino acid sequences directly from tandem mass (MS/MS) spectra without matching against a reference protein database. Its central contribution is handling post-translational modifications (PTMs) within a single coherent model. Most prior de novo sequencers are restricted to unmodified peptides or to a single modification type (for example, a dedicated phosphorylation model), forcing practitioners to swap tools when the modification of interest changes. OmniNovo instead learns the universal fragmentation rules that govern how modified and unmodified peptides break apart in the mass spectrometer, allowing one model to decode diverse PTMs.
The model was introduced in December 2025 in the preprint "Accurate de novo sequencing of the modified proteome with OmniNovo" (arXiv:2512.12272). It was developed by a collaboration led by groups at Fudan University and Shanghai Artificial Intelligence Laboratory, with contributors from Tsinghua University, Westlake University, Tongji University, the Shanghai Innovation Institute, the University of British Columbia, Zhejiang University, and Stony Brook University. The corresponding authors include Siqi Sun (Fudan University) and senior proteomics and AI researchers across the consortium.
Within the de novo sequencing landscape — which includes tools such as Casanovo, InstaNovo, and π-PrimeNovo — OmniNovo positions itself as a single model that spans the unmodified and modified proteome, paired with rigorous false discovery rate (FDR) control so that its predictions can be trusted in real proteomics workflows.
OmniNovo is a non-autoregressive transformer that translates a tandem mass spectrum into an amino acid sequence in a single forward pass. The architecture comprises a 12-layer transformer encoder/decoder with a hidden dimension of 256, 16 attention heads, a feed-forward dimension of 768, and a vocabulary of 31 tokens (20 amino acids plus 11 PTM tokens), totaling roughly 35 million parameters. Training drew on approximately 51.8 million peptide-spectrum matches covering about 4.7 million unique precursors spanning eleven PTM types, assembled from public repositories including MassIVE-KB, PRIDE, and iProX. On the standard nine-species benchmark OmniNovo reported 69.0% peptide recall (77.2% on a revised version of the benchmark), outperforming InstaNovo-P, Casanovo V2, and π-PrimeNovo, and it improved average performance on PTMBench by 27–29% over the modification-aware π-PrimeNovo baseline. On a real FGFR2 dataset it identified 51% more peptides than a standard database approach at 1% FDR.
OmniNovo targets bottom-up proteomics workflows where post-translational modifications matter, such as profiling phosphorylation in signaling studies, characterizing antibody and protein therapeutics, and analyzing proteoforms that are difficult to capture with database search. Because it is reference-free, it is also useful for immunopeptidomics, metaproteomics, and other settings where the relevant sequences may be missing from reference databases. Its built-in FDR control makes the predictions usable for researchers who need confidence estimates rather than raw guesses.
De novo sequencing of modified peptides has historically been fragmented across modification-specific tools, limiting its routine use. By unifying unmodified and modified sequencing in one pretrained model with calibrated FDR control, OmniNovo offers a path toward general-purpose interpretation of the modified proteome. As a December 2025 preprint under review, its long-term adoption remains to be established, and at the time of writing the authors had not released a public code or model-weights repository, which currently limits independent reproduction and deployment.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data