bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

Casanovo

Noble Lab

A transformer model for de novo peptide sequencing from tandem mass spectra, trained on 30 million labeled spectra from public proteomics datasets.

Released: 2022

Overview

Casanovo is a transformer-based model for de novo peptide sequencing — the task of reading an amino acid sequence directly from a tandem mass spectrum (MS/MS) without consulting a protein sequence database. Developed by William Stafford Noble's lab at the University of Washington, it reframes de novo sequencing as a sequence-to-sequence translation problem, applying the same attention-based architecture that powers modern natural language translation to the domain of proteomics.

Traditional database-dependent approaches to peptide identification can only identify peptides whose sequences appear in a reference database, making them blind to novel sequences, post-translational modifications outside the database, or peptides from organisms with poorly annotated genomes. Casanovo sidesteps this limitation entirely by treating each mass spectrum as an input sequence and the corresponding peptide as the output sequence to be decoded. This framing unlocks analysis of immunopeptidomics, metaproteomics, venomics, and other applications where the peptide space cannot be fully enumerated in advance.

The model was first presented at ICML 2022 and later substantially extended in a Nature Communications paper published in July 2024, which demonstrated improved benchmark performance, fine-tuning for non-enzymatic digestion, and applications to the "dark proteome" — the large fraction of detected spectra that database search methods fail to identify.

Key Features

  • Database-free identification: Decodes peptide sequences directly from MS/MS spectra without requiring a protein sequence database, enabling identification of novel peptides, unexpected modifications, and sequences from poorly annotated organisms.
  • Transformer encoder-decoder architecture: Models de novo sequencing as machine translation, using sinusoidal m/z embeddings that span wavelengths from 0.001 to 10,000 m/z, preserving both isotope-level detail and long-range relationships without discretizing the m/z axis.
  • Beam search decoding with precursor mass filtering: Replaces the combinatorial dynamic programming of earlier methods with efficient beam search, then filters predictions using precursor mass tolerance to eliminate implausible sequences — yielding 97% of predictions within 30 ppm tolerance.
  • Fine-tuning for specialized applications: The base model can be adapted with modest additional training data; a non-enzymatic fine-tuned variant (Casanovo_ne) achieves 0.83 average precision on immunopeptidomics data versus 0.60 for the base model.
  • Scalable open-source deployment: Pre-trained weights and a command-line interface are distributed under the Apache 2.0 license, and model weights are versioned through GitHub releases.

Technical Details

Casanovo uses an encoder-decoder transformer with approximately 47 million parameters: 9 encoder layers and 9 decoder layers, an embedding dimension of 512, and 8 attention heads. The vocabulary covers 28 tokens, including the 20 canonical amino acids, common modification variants, PTM annotations, and a stop token. Spectra are encoded without discretization — m/z values are embedded with sinusoidal functions tuned to the relevant frequency range, and intensity values are projected through a learned linear layer. This design allows the model to process raw peak lists directly and handle variable-length spectra without binning.

The model was trained from scratch for a single epoch on 30 million high-confidence peptide-spectrum matches drawn from 227 public proteomics datasets in the MassIVE-KB spectral library, using 4 NVIDIA RTX 2080 Ti GPUs over approximately 8 days. On the revised nine-species cross-species benchmark — where training and test sets are drawn from different species to assess generalization — Casanovo achieves 0.95 average precision at the peptide level and 0.98 at the amino acid level, substantially outperforming earlier deep learning methods such as DeepNovo (0.70) and PointNovo (0.74), as well as the rule-based tool Novor (0.58).

Applications

Casanovo is particularly valuable in settings where database search is fundamentally limited. In immunopeptidomics, MHC-bound peptides arise from proteolytic cleavage patterns that diverge from standard tryptic digestion, and Casanovo's fine-tuned non-enzymatic variant identifies 87% of its unique predictions as likely MHC binders. In metaproteomics applied to ocean samples, Casanovo detected 44–47% more peptides than database search against environmental reference databases. The model also contributes to dark proteome analysis, assigning sequences to nearly 197,000 spectra from a set of 3.4 million that database methods left unidentified, with 83% of predicted variants corresponding to plausible single-nucleotide substitutions. These capabilities extend Casanovo's reach to palaeoproteomics, venomics, and any proteomics context where peptide sequences cannot be fully enumerated in a reference.

Impact

Casanovo has established a clear benchmark for transformer-based de novo peptide sequencing and demonstrated that framing mass spectrometry analysis as sequence-to-sequence translation is both principled and practically effective. Its publication trajectory — from an ICML conference paper to a Nature Communications article with extended applications — reflects growing community interest in database-free proteomics. The release of pre-trained weights and a user-friendly command-line tool has lowered the barrier to adoption considerably. A limitation remains that performance degrades on peptides with uncommon modifications not well-represented in MassIVE-KB, and inference on very large datasets is still computationally intensive compared to fast database search tools. Nonetheless, Casanovo has materially advanced the state of the art for open proteomics and directly enables scientific questions that were previously inaccessible with database-dependent methods.

Citation

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Yilmaz, M., et al. (2024) Sequence-to-sequence translation from mass spectra to peptides with a transformer model. bioRxiv.

DOI: 10.1038/s41467-024-49731-x

Metrics

GitHub

Stars184
Forks70
Open Issues34
Contributors12
Last Push3d ago
LanguagePython
LicenseApache-2.0

Citations

Total Citations90
Influential9
References58

Tags

foundation modelmass spectrometryproteomics

Resources

GitHub RepositoryResearch Paper