OmniNovo

Fudan University / Shanghai AI Laboratory / Tsinghua University / Westlake University / Tongji University / Shanghai Innovation Institute / University of British Columbia / Zhejiang University / Stony Brook University

De novo peptide sequencing transformer that reads modified and unmodified peptides directly from tandem mass spectra without a reference database.

Released: December 2025

Parameters: 35 Million

OmniNovo is a pretrained, unified deep learning model for de novo peptide sequencing — the task of reading amino acid sequences directly from tandem mass (MS/MS) spectra without matching against a reference protein database. Its central contribution is handling post-translational modifications (PTMs) within a single coherent model. Most prior de novo sequencers are restricted to unmodified peptides or to a single modification type (for example, a dedicated phosphorylation model), forcing practitioners to swap tools when the modification of interest changes. OmniNovo instead learns the universal fragmentation rules that govern how modified and unmodified peptides break apart in the mass spectrometer, allowing one model to decode diverse PTMs.

The model was introduced in December 2025 in the preprint "Accurate de novo sequencing of the modified proteome with OmniNovo" (arXiv:2512.12272). It was developed by a collaboration led by groups at Fudan University and Shanghai Artificial Intelligence Laboratory, with contributors from Tsinghua University, Westlake University, Tongji University, the Shanghai Innovation Institute, the University of British Columbia, Zhejiang University, and Stony Brook University. The corresponding authors include Siqi Sun (Fudan University) and senior proteomics and AI researchers across the consortium.

Within the de novo sequencing landscape — which includes tools such as Casanovo, InstaNovo, and π-PrimeNovo — OmniNovo positions itself as a single model that spans the unmodified and modified proteome, paired with rigorous false discovery rate (FDR) control so that its predictions can be trusted in real proteomics workflows.

Key Features

Universal PTM coverage: A single model decodes peptides carrying eleven common modification types — including phosphorylation, acetylation, methylation, dimethylation, ubiquitination, oxidation, carbamidomethylation, deamidation, and carbamylation — rather than requiring a separate model per modification.
Reference-free sequencing: OmniNovo reads sequences directly from spectra without a protein database, making it suitable for samples where the relevant sequences are absent from or poorly represented in references.
Zero-shot generalization: The model extends to unseen modification contexts, improving phosphorylation recall in a zero-shot setting across seven external datasets without modification-specific retraining.
Mass-constrained decoding with FDR control: A mass-constrained decoding algorithm is combined with rigorous FDR estimation; under entrapment testing at a 1% nominal FDR, the false discovery proportion remained negligible (about 0.019% at 1× and 0.03% at 100× entrapment).

Technical Details

OmniNovo is a non-autoregressive transformer that translates a tandem mass spectrum into an amino acid sequence in a single forward pass. The architecture comprises a 12-layer transformer encoder/decoder with a hidden dimension of 256, 16 attention heads, a feed-forward dimension of 768, and a vocabulary of 31 tokens (20 amino acids plus 11 PTM tokens), totaling roughly 35 million parameters. Training drew on approximately 51.8 million peptide-spectrum matches covering about 4.7 million unique precursors spanning eleven PTM types, assembled from public repositories including MassIVE-KB, PRIDE, and iProX. On the standard nine-species benchmark OmniNovo reported 69.0% peptide recall (77.2% on a revised version of the benchmark), outperforming InstaNovo-P, Casanovo V2, and π-PrimeNovo, and it improved average performance on PTMBench by 27–29% over the modification-aware π-PrimeNovo baseline. On a real FGFR2 dataset it identified 51% more peptides than a standard database approach at 1% FDR.

Applications

OmniNovo targets bottom-up proteomics workflows where post-translational modifications matter, such as profiling phosphorylation in signaling studies, characterizing antibody and protein therapeutics, and analyzing proteoforms that are difficult to capture with database search. Because it is reference-free, it is also useful for immunopeptidomics, metaproteomics, and other settings where the relevant sequences may be missing from reference databases. Its built-in FDR control makes the predictions usable for researchers who need confidence estimates rather than raw guesses.

Impact

De novo sequencing of modified peptides has historically been fragmented across modification-specific tools, limiting its routine use. By unifying unmodified and modified sequencing in one pretrained model with calibrated FDR control, OmniNovo offers a path toward general-purpose interpretation of the modified proteome. As a December 2025 preprint under review, its long-term adoption remains to be established, and at the time of writing the authors had not released a public code or model-weights repository, which currently limits independent reproduction and deployment.

Citation

Accurate de novo sequencing of the modified proteome with OmniNovo

Preprint

Chen, Y., et al. (2025) Accurate de novo sequencing of the modified proteome with OmniNovo. arXiv.org.

DOI: 10.48550/arXiv.2512.12272

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

14Closed

Usability — can I run it?9

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Universal PTM coverage: A single model decodes peptides carrying eleven common modification types — including phosphorylation, acetylation, methylation, dimethylation, ubiquitination, oxidation, carbamidomethylation, deamidation, and carbamylation — rather than requiring a separate model per modification.

Reference-free sequencing: OmniNovo reads sequences directly from spectra without a protein database, making it suitable for samples where the relevant sequences are absent from or poorly represented in references.

Zero-shot generalization: The model extends to unseen modification contexts, improving phosphorylation recall in a zero-shot setting across seven external datasets without modification-specific retraining.

Mass-constrained decoding with FDR control: A mass-constrained decoding algorithm is combined with rigorous FDR estimation; under entrapment testing at a 1% nominal FDR, the false discovery proportion remained negligible (about 0.019% at 1× and 0.03% at 100× entrapment).

Technical Details

Applications

Impact

OmniNovo

Key Features

Technical Details

Applications

Impact

Citation

Accurate de novo sequencing of the modified proteome with OmniNovo

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OmniNovo

Key Features

Technical Details

Applications

Impact

Citation

Accurate de novo sequencing of the modified proteome with OmniNovo

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OmniNovo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Accurate de novo sequencing of the modified proteome with OmniNovo

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

OmniNovo

#Key Features

#Technical Details

#Applications

#Impact

Citation

Accurate de novo sequencing of the modified proteome with OmniNovo

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact