Casanovo is a transformer-based model for de novo peptide sequencing — the task of reading an amino acid sequence directly from a tandem mass spectrum (MS/MS) without consulting a protein sequence database. Developed by William Stafford Noble's lab at the University of Washington, it reframes de novo sequencing as a sequence-to-sequence translation problem, applying the same attention-based architecture that powers modern natural language translation to the domain of proteomics.
Traditional database-dependent approaches to peptide identification can only identify peptides whose sequences appear in a reference database, making them blind to novel sequences, post-translational modifications outside the database, or peptides from organisms with poorly annotated genomes. Casanovo sidesteps this limitation entirely by treating each mass spectrum as an input sequence and the corresponding peptide as the output sequence to be decoded. This framing unlocks analysis of immunopeptidomics, metaproteomics, venomics, and other applications where the peptide space cannot be fully enumerated in advance.
The model was first presented at ICML 2022 and later substantially extended in a Nature Communications paper published in July 2024, which demonstrated improved benchmark performance, fine-tuning for non-enzymatic digestion, and applications to the "dark proteome" — the large fraction of detected spectra that database search methods fail to identify.
Casanovo uses an encoder-decoder transformer with approximately 47 million parameters: 9 encoder layers and 9 decoder layers, an embedding dimension of 512, and 8 attention heads. The vocabulary covers 28 tokens, including the 20 canonical amino acids, common modification variants, PTM annotations, and a stop token. Spectra are encoded without discretization — m/z values are embedded with sinusoidal functions tuned to the relevant frequency range, and intensity values are projected through a learned linear layer. This design allows the model to process raw peak lists directly and handle variable-length spectra without binning.
The model was trained from scratch for a single epoch on 30 million high-confidence peptide-spectrum matches drawn from 227 public proteomics datasets in the MassIVE-KB spectral library, using 4 NVIDIA RTX 2080 Ti GPUs over approximately 8 days. On the revised nine-species cross-species benchmark — where training and test sets are drawn from different species to assess generalization — Casanovo achieves 0.95 average precision at the peptide level and 0.98 at the amino acid level, substantially outperforming earlier deep learning methods such as DeepNovo (0.70) and PointNovo (0.74), as well as the rule-based tool Novor (0.58).
Casanovo is particularly valuable in settings where database search is fundamentally limited. In immunopeptidomics, MHC-bound peptides arise from proteolytic cleavage patterns that diverge from standard tryptic digestion, and Casanovo's fine-tuned non-enzymatic variant identifies 87% of its unique predictions as likely MHC binders. In metaproteomics applied to ocean samples, Casanovo detected 44–47% more peptides than database search against environmental reference databases. The model also contributes to dark proteome analysis, assigning sequences to nearly 197,000 spectra from a set of 3.4 million that database methods left unidentified, with 83% of predicted variants corresponding to plausible single-nucleotide substitutions. These capabilities extend Casanovo's reach to palaeoproteomics, venomics, and any proteomics context where peptide sequences cannot be fully enumerated in a reference.
Casanovo has established a clear benchmark for transformer-based de novo peptide sequencing and demonstrated that framing mass spectrometry analysis as sequence-to-sequence translation is both principled and practically effective. Its publication trajectory — from an ICML conference paper to a Nature Communications article with extended applications — reflects growing community interest in database-free proteomics. The release of pre-trained weights and a user-friendly command-line tool has lowered the barrier to adoption considerably. A limitation remains that performance degrades on peptides with uncommon modifications not well-represented in MassIVE-KB, and inference on very large datasets is still computationally intensive compared to fast database search tools. Nonetheless, Casanovo has materially advanced the state of the art for open proteomics and directly enables scientific questions that were previously inaccessible with database-dependent methods.
Yilmaz, M., et al. (2024) Sequence-to-sequence translation from mass spectra to peptides with a transformer model. bioRxiv.
DOI: 10.1038/s41467-024-49731-x