Harvard University / Howard Hughes Medical Institute
Protein language model for per-residue domain annotation, pairing an ESM-2 backbone with a probabilistic decoder to rival HMMER on sequence domain assignment.
Annotating the domain architecture of a protein — identifying which conserved functional modules a sequence contains and where they fall along the chain — is a foundational step in understanding protein function. For decades this task has been dominated by profile hidden Markov models (HMMs), most prominently the HMMER suite paired with the Pfam database. PSALM (Protein Sequence Annotation with Language Models) reframes domain annotation as a deep learning problem, asking whether a modern protein language model can match the sensitivity and specificity of these well-established statistical tools.
PSALM was developed by the Eddy lab — the group behind HMMER — at Harvard University, with Sean Eddy as an HHMI Investigator. The first preprint appeared in 2024. Rather than discarding the probabilistic rigor of HMM-based methods, PSALM combines the representational power of a pretrained protein language model with a structured probabilistic decoder, blending learned sequence features with explicit modeling of domain structure along the sequence.
This hybrid design situates PSALM at the intersection of two traditions: the language-model revolution in protein sequence analysis (ESM, ProtTrans) and the HMM-based annotation infrastructure that underpins databases like Pfam. The authors release code, model weights, and datasets, supporting reproducibility and adoption.
PSALM couples a pretrained ESM-2 transformer backbone with a per-residue domain-state classifier and a structured probabilistic decoder that imposes consistency on the predicted domain architecture. The system was benchmarked on a large evaluation set of 89 million protein sequences containing 107 million domain instances, a scale that tests both sensitivity and computational practicality. On this benchmark PSALM delivers sensitivity–specificity performance comparable to HMMER, the long-standing gold standard for profile-HMM domain annotation, and shows particular strength at relaxed thresholds where remote homologs are harder to detect. By inheriting ESM-2's learned representations, PSALM can recognize domain signatures that may be difficult to capture with sequence-profile statistics alone.
PSALM is aimed at researchers performing functional annotation of proteins and proteomes: assigning domain architectures to newly sequenced proteins, scanning large sequence databases, and detecting remote homologs that fall below the detection threshold of profile-HMM methods. Because it is released with weights and code, it can be integrated into annotation pipelines as a complement or alternative to HMMER and Pfam, and it provides a template for applying protein language models to structured sequence-labeling tasks.
Coming from the lab that created HMMER, PSALM is a notable signal that domain annotation — one of the most mature areas of computational protein analysis — can benefit from protein language models without abandoning probabilistic rigor. By demonstrating HMMER-comparable accuracy at the scale of tens of millions of sequences, it shows that learned representations can complement, rather than merely replace, established statistical infrastructure. Its open release lowers the barrier for adoption and for further research on hybrid language-model/probabilistic annotation methods. As a recent preprint, its long-term influence on Pfam-scale annotation workflows is still emerging.