bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

PSALM

Harvard University / Howard Hughes Medical Institute

Protein language model for per-residue domain annotation, pairing an ESM-2 backbone with a probabilistic decoder to rival HMMER on sequence domain assignment.

Released: October 2024

Annotating the domain architecture of a protein — identifying which conserved functional modules a sequence contains and where they fall along the chain — is a foundational step in understanding protein function. For decades this task has been dominated by profile hidden Markov models (HMMs), most prominently the HMMER suite paired with the Pfam database. PSALM (Protein Sequence Annotation with Language Models) reframes domain annotation as a deep learning problem, asking whether a modern protein language model can match the sensitivity and specificity of these well-established statistical tools.

PSALM was developed by the Eddy lab — the group behind HMMER — at Harvard University, with Sean Eddy as an HHMI Investigator. The first preprint appeared in 2024. Rather than discarding the probabilistic rigor of HMM-based methods, PSALM combines the representational power of a pretrained protein language model with a structured probabilistic decoder, blending learned sequence features with explicit modeling of domain structure along the sequence.

This hybrid design situates PSALM at the intersection of two traditions: the language-model revolution in protein sequence analysis (ESM, ProtTrans) and the HMM-based annotation infrastructure that underpins databases like Pfam. The authors release code, model weights, and datasets, supporting reproducibility and adoption.

#Key Features

  • ESM-2 backbone: PSALM builds on the pretrained ESM-2 protein language model, reusing its learned per-residue representations rather than training a sequence encoder from scratch.
  • Per-residue domain-state classifier: A classifier assigns each residue to a domain state, producing fine-grained, position-level annotations along the sequence.
  • Structured probabilistic decoder: A decoding layer enforces coherent domain architecture across the sequence, echoing the structured inference of HMM-based methods.
  • HMMER-comparable accuracy: On large-scale benchmarks PSALM achieves sensitivity and specificity comparable to HMMER, with advantages at relaxed detection thresholds.
  • Fully open release: Code, trained model weights, and benchmark datasets are publicly available.

#Technical Details

PSALM couples a pretrained ESM-2 transformer backbone with a per-residue domain-state classifier and a structured probabilistic decoder that imposes consistency on the predicted domain architecture. The system was benchmarked on a large evaluation set of 89 million protein sequences containing 107 million domain instances, a scale that tests both sensitivity and computational practicality. On this benchmark PSALM delivers sensitivity–specificity performance comparable to HMMER, the long-standing gold standard for profile-HMM domain annotation, and shows particular strength at relaxed thresholds where remote homologs are harder to detect. By inheriting ESM-2's learned representations, PSALM can recognize domain signatures that may be difficult to capture with sequence-profile statistics alone.

#Applications

PSALM is aimed at researchers performing functional annotation of proteins and proteomes: assigning domain architectures to newly sequenced proteins, scanning large sequence databases, and detecting remote homologs that fall below the detection threshold of profile-HMM methods. Because it is released with weights and code, it can be integrated into annotation pipelines as a complement or alternative to HMMER and Pfam, and it provides a template for applying protein language models to structured sequence-labeling tasks.

#Impact

Coming from the lab that created HMMER, PSALM is a notable signal that domain annotation — one of the most mature areas of computational protein analysis — can benefit from protein language models without abandoning probabilistic rigor. By demonstrating HMMER-comparable accuracy at the scale of tens of millions of sequences, it shows that learned representations can complement, rather than merely replace, established statistical infrastructure. Its open release lowers the barrier for adoption and for further research on hybrid language-model/probabilistic annotation methods. As a recent preprint, its long-term influence on Pfam-scale annotation workflows is still emerging.

Tags

domain_annotationsequence_annotationtransformerlanguage_modeltransfer_learningproteomics