Casanovo

Transformer model for de novo peptide sequencing that reads amino acid sequences directly from tandem mass spectra, with no protein sequence database.

Released: July 2022

Casanovo is a transformer-based model for de novo peptide sequencing — the task of reading an amino acid sequence directly from a tandem mass spectrum (MS/MS) without consulting a protein sequence database. Developed by William Stafford Noble's lab at the University of Washington, it reframes de novo sequencing as a sequence-to-sequence translation problem, applying the same attention-based architecture that powers modern natural language translation to the domain of proteomics.

Traditional database-dependent approaches to peptide identification can only identify peptides whose sequences appear in a reference database, making them blind to novel sequences, post-translational modifications outside the database, or peptides from organisms with poorly annotated genomes. Casanovo sidesteps this limitation entirely by treating each mass spectrum as an input sequence and the corresponding peptide as the output sequence to be decoded. This framing unlocks analysis of immunopeptidomics, metaproteomics, venomics, and other applications where the peptide space cannot be fully enumerated in advance.

The model was first presented at ICML 2022 and later substantially extended in a Nature Communications paper published in July 2024, which demonstrated improved benchmark performance, fine-tuning for non-enzymatic digestion, and applications to the "dark proteome" — the large fraction of detected spectra that database search methods fail to identify.

Key Features

Database-free identification: Decodes peptide sequences directly from MS/MS spectra without requiring a protein sequence database, enabling identification of novel peptides, unexpected modifications, and sequences from poorly annotated organisms.
Transformer encoder-decoder architecture: Models de novo sequencing as machine translation, using sinusoidal m/z embeddings that span wavelengths from 0.001 to 10,000 m/z, preserving both isotope-level detail and long-range relationships without discretizing the m/z axis.
Beam search decoding with precursor mass filtering: Replaces the combinatorial dynamic programming of earlier methods with efficient beam search, then filters predictions using precursor mass tolerance to eliminate implausible sequences — yielding 97% of predictions within 30 ppm tolerance.
Fine-tuning for specialized applications: The base model can be adapted with modest additional training data; a non-enzymatic fine-tuned variant (Casanovo_ne) achieves 0.83 average precision on immunopeptidomics data versus 0.60 for the base model.
Scalable open-source deployment: Pre-trained weights and a command-line interface are distributed under the Apache 2.0 license, and model weights are versioned through GitHub releases.

Technical Details

Casanovo uses an encoder-decoder transformer with approximately 47 million parameters: 9 encoder layers and 9 decoder layers, an embedding dimension of 512, and 8 attention heads. The vocabulary covers 28 tokens, including the 20 canonical amino acids, common modification variants, PTM annotations, and a stop token. Spectra are encoded without discretization — m/z values are embedded with sinusoidal functions tuned to the relevant frequency range, and intensity values are projected through a learned linear layer. This design allows the model to process raw peak lists directly and handle variable-length spectra without binning.

The model was trained from scratch for a single epoch on 30 million high-confidence peptide-spectrum matches drawn from 227 public proteomics datasets in the MassIVE-KB spectral library, using 4 NVIDIA RTX 2080 Ti GPUs over approximately 8 days. On the revised nine-species cross-species benchmark — where training and test sets are drawn from different species to assess generalization — Casanovo achieves 0.95 average precision at the peptide level and 0.98 at the amino acid level, substantially outperforming earlier deep learning methods such as DeepNovo (0.70) and PointNovo (0.74), as well as the rule-based tool Novor (0.58).

Applications

Casanovo is particularly valuable in settings where database search is fundamentally limited. In immunopeptidomics, MHC-bound peptides arise from proteolytic cleavage patterns that diverge from standard tryptic digestion, and Casanovo's fine-tuned non-enzymatic variant identifies 87% of its unique predictions as likely MHC binders. In metaproteomics applied to ocean samples, Casanovo detected 44–47% more peptides than database search against environmental reference databases. The model also contributes to dark proteome analysis, assigning sequences to nearly 197,000 spectra from a set of 3.4 million that database methods left unidentified, with 83% of predicted variants corresponding to plausible single-nucleotide substitutions. These capabilities extend Casanovo's reach to palaeoproteomics, venomics, and any proteomics context where peptide sequences cannot be fully enumerated in a reference.

Impact

Casanovo has established a clear benchmark for transformer-based de novo peptide sequencing and demonstrated that framing mass spectrometry analysis as sequence-to-sequence translation is both principled and practically effective. Its publication trajectory — from an ICML conference paper to a Nature Communications article with extended applications — reflects growing community interest in database-free proteomics. A 2025 follow-up (Sanders, Yilmaz, Noble et al.) further showed that the spectrum encoder pre-trained for de novo sequencing functions as a multi-task foundation model for proteomics: its learned spectrum representations transfer to downstream tasks such as spectrum quality prediction, chimericity detection, and prediction of phosphorylation and glycosylation status, with the largest gains where labeled training data is scarce. The release of pre-trained weights and a user-friendly command-line tool has lowered the barrier to adoption considerably. A limitation remains that performance degrades on peptides with uncommon modifications not well-represented in MassIVE-KB, and inference on very large datasets is still computationally intensive compared to fast database search tools. Nonetheless, Casanovo has materially advanced the state of the art for open proteomics and directly enables scientific questions that were previously inaccessible with database-dependent methods.

Citations

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Yilmaz, M., et al. (2024) Sequence-to-sequence translation from mass spectra to peptides with a transformer model. bioRxiv.

DOI: 10.1038/s41467-024-49731-x

Foundation model for mass spectrometry proteomics

Preprint

Sanders, J. A., et al. (2025) Foundation model for mass spectrometry proteomics. arXiv.org.

DOI: 10.48550/arXiv.2505.10848

Recent citations

Papers that recently cited this model.

A large-scale unified deep learning model for peptide mass spectrum interpretation trained on multimodal data
Jiale Zhao, Pengzhi Mao, Kaifei Wang, et al.
Nature Machine Intelligence · May 2026
0
Robotic perturbation proteomics and AI agents enable scalable drug mechanism discovery
Yuming Jiang, Cameron S. Movassaghi, Jesús Muñoz-Estrada, et al.
bioRxiv · May 2026
0
PepSpecBench: A Unified Evaluation Benchmark for Peptide Tandem Mass Spectrometry Prediction
Zhiwen Yang, Panlong Liu, Yifan Li, et al.
May 2026
0

Top citations

The most-cited papers that cite this model.

MSBooster: improving peptide identification rates using deep learning-based features
Kevin L. Yang, Fengchao Yu, G. Teo, et al.
bioRxiv · Oct 2022
176
Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS
Roman Bushuiev, Anton Bushuiev, Raman Samusevich, et al.
Nature Biotechnology · May 2025
53
Predicting glycan structure from tandem mass spectrometry via deep learning
James Urban, C. Jin, K. Thomsson, et al.
bioRxiv · Jun 2023
51
The microbiologist's guide to metaproteomics
T. Van Den Bossche, J. Armengaud, D. Benndorf, et al.
iMeta · May 2025
38
Recent Advances in Mass Spectrometry-Based Bottom-Up Proteomics
Cameron S. Movassaghi, Jie Sun, Yuming Jiang, et al.
Analytical Chemistry · Feb 2025
33

Citations

Total Citations4

Influential0

References48

GitHub

Stars194

Forks77

Open Issues27

Contributors21

Last Push16d ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science79%
Biology79%
Medicine60%
Chemistry42%
Environmental Science10%
Materials Science3%
Engineering2%
Agricultural and Food Sciences1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

91Open

Usability — can I run it?95

Reproducibility — can I retrain it?85

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Research Paper Documentation Dataset

Key Features

Database-free identification: Decodes peptide sequences directly from MS/MS spectra without requiring a protein sequence database, enabling identification of novel peptides, unexpected modifications, and sequences from poorly annotated organisms.

Transformer encoder-decoder architecture: Models de novo sequencing as machine translation, using sinusoidal m/z embeddings that span wavelengths from 0.001 to 10,000 m/z, preserving both isotope-level detail and long-range relationships without discretizing the m/z axis.

Beam search decoding with precursor mass filtering: Replaces the combinatorial dynamic programming of earlier methods with efficient beam search, then filters predictions using precursor mass tolerance to eliminate implausible sequences — yielding 97% of predictions within 30 ppm tolerance.

Fine-tuning for specialized applications: The base model can be adapted with modest additional training data; a non-enzymatic fine-tuned variant (Casanovo_ne) achieves 0.83 average precision on immunopeptidomics data versus 0.60 for the base model.

Scalable open-source deployment: Pre-trained weights and a command-line interface are distributed under the Apache 2.0 license, and model weights are versioned through GitHub releases.

Technical Details

Applications

Impact

Citations

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Yilmaz, M., et al. (2024) Sequence-to-sequence translation from mass spectra to peptides with a transformer model. bioRxiv.

DOI: 10.1038/s41467-024-49731-x

Foundation model for mass spectrometry proteomics

Preprint

Sanders, J. A., et al. (2025) Foundation model for mass spectrometry proteomics. arXiv.org.

DOI: 10.48550/arXiv.2505.10848

Recent citations

Papers that recently cited this model.

A large-scale unified deep learning model for peptide mass spectrum interpretation trained on multimodal data

Jiale Zhao, Pengzhi Mao, Kaifei Wang, et al.

Nature Machine Intelligence · May 2026

Robotic perturbation proteomics and AI agents enable scalable drug mechanism discovery

Yuming Jiang, Cameron S. Movassaghi, Jesús Muñoz-Estrada, et al.

bioRxiv · May 2026

PepSpecBench: A Unified Evaluation Benchmark for Peptide Tandem Mass Spectrometry Prediction

Zhiwen Yang, Panlong Liu, Yifan Li, et al.

May 2026

Casanovo

#Key Features

#Technical Details

#Applications

#Impact

Citations

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Foundation model for mass spectrometry proteomics

Recent citations

PepSpecBench: A Unified Evaluation Benchmark for Peptide Tandem Mass Spectrometry Prediction

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Casanovo

#Key Features

#Technical Details

#Applications

#Impact

Citations

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Foundation model for mass spectrometry proteomics

Recent citations

PepSpecBench: A Unified Evaluation Benchmark for Peptide Tandem Mass Spectrometry Prediction

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact