PSALM

Harvard University / Howard Hughes Medical Institute

Protein domain annotation model pairing an ESM-2 backbone with a probabilistic decoder, bringing language-model sensitivity to Pfam-style assignment.

Released: October 2024

Annotating the domain architecture of a protein — identifying which conserved functional modules a sequence contains and where they fall along the chain — is a foundational step in understanding protein function. For decades this task has been dominated by profile hidden Markov models (HMMs), most prominently the HMMER suite paired with the Pfam database. PSALM (Protein Sequence Annotation with Language Models) reframes domain annotation as a deep learning problem, asking whether a modern protein language model can match the sensitivity and specificity of these well-established statistical tools.

PSALM was developed by the Eddy lab — the group behind HMMER — at Harvard University, with Sean Eddy as an HHMI Investigator. The first preprint appeared in 2024. Rather than discarding the probabilistic rigor of HMM-based methods, PSALM combines the representational power of a pretrained protein language model with a structured probabilistic decoder, blending learned sequence features with explicit modeling of domain structure along the sequence.

This hybrid design situates PSALM at the intersection of two traditions: the language-model revolution in protein sequence analysis (ESM, ProtTrans) and the HMM-based annotation infrastructure that underpins databases like Pfam. The authors release code, model weights, and datasets, supporting reproducibility and adoption.

Key Features

ESM-2 backbone: PSALM builds on the pretrained ESM-2 protein language model, reusing its learned per-residue representations rather than training a sequence encoder from scratch.
Per-residue domain-state classifier: A classifier assigns each residue to a domain state, producing fine-grained, position-level annotations along the sequence.
Structured probabilistic decoder: A decoding layer enforces coherent domain architecture across the sequence, echoing the structured inference of HMM-based methods.
HMMER-comparable accuracy: On large-scale benchmarks PSALM achieves sensitivity and specificity comparable to HMMER, with advantages at relaxed detection thresholds.
Fully open release: Code, trained model weights, and benchmark datasets are publicly available.

Technical Details

PSALM couples a pretrained ESM-2 transformer backbone with a per-residue domain-state classifier and a structured probabilistic decoder that imposes consistency on the predicted domain architecture. The system was benchmarked on a large evaluation set of 89 million protein sequences containing 107 million domain instances, a scale that tests both sensitivity and computational practicality. On this benchmark PSALM delivers sensitivity–specificity performance comparable to HMMER, the long-standing gold standard for profile-HMM domain annotation, and shows particular strength at relaxed thresholds where remote homologs are harder to detect. By inheriting ESM-2's learned representations, PSALM can recognize domain signatures that may be difficult to capture with sequence-profile statistics alone.

Applications

PSALM is aimed at researchers performing functional annotation of proteins and proteomes: assigning domain architectures to newly sequenced proteins, scanning large sequence databases, and detecting remote homologs that fall below the detection threshold of profile-HMM methods. Because it is released with weights and code, it can be integrated into annotation pipelines as a complement or alternative to HMMER and Pfam, and it provides a template for applying protein language models to structured sequence-labeling tasks.

Impact

Coming from the lab that created HMMER, PSALM is a notable signal that domain annotation — one of the most mature areas of computational protein analysis — can benefit from protein language models without abandoning probabilistic rigor. By demonstrating HMMER-comparable accuracy at the scale of tens of millions of sequences, it shows that learned representations can complement, rather than merely replace, established statistical infrastructure. Its open release lowers the barrier for adoption and for further research on hybrid language-model/probabilistic annotation methods. As a recent preprint, its long-term influence on Pfam-scale annotation workflows is still emerging.

Citation

Protein sequence domain annotation using a language model

Preprint

Sarkar, A., et al. (2026) Protein sequence domain annotation using a language model. bioRxiv.

DOI: 10.1101/2024.06.04.596712

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References36

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

91Open

Usability — can I run it?99

Reproducibility — can I retrain it?87

Model Openness Framework

Class III

Open Model

Resources

Research Paper

Key Features

ESM-2 backbone: PSALM builds on the pretrained ESM-2 protein language model, reusing its learned per-residue representations rather than training a sequence encoder from scratch.

Per-residue domain-state classifier: A classifier assigns each residue to a domain state, producing fine-grained, position-level annotations along the sequence.

Structured probabilistic decoder: A decoding layer enforces coherent domain architecture across the sequence, echoing the structured inference of HMM-based methods.

HMMER-comparable accuracy: On large-scale benchmarks PSALM achieves sensitivity and specificity comparable to HMMER, with advantages at relaxed detection thresholds.

Fully open release: Code, trained model weights, and benchmark datasets are publicly available.

Technical Details

Applications

Impact

PSALM

Key Features

Technical Details

Applications

Impact

Citation

Protein sequence domain annotation using a language model

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

PSALM

Key Features

Technical Details

Applications

Impact

Citation

Protein sequence domain annotation using a language model

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

PSALM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein sequence domain annotation using a language model

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

PSALM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein sequence domain annotation using a language model

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact