Overview

PLMSearch is a homologous protein search method developed at Fudan University that replaces traditional sequence-alignment heuristics with deep representations from pre-trained protein language models. Published in Nature Communications in 2024, it addresses a persistent limitation of tools like BLAST and MMseqs2: their reliance on sequence similarity scores that become unreliable when evolutionary divergence is high. Proteins can share a common fold and function while retaining less than 20% sequence identity — a regime where conventional methods lose most of their sensitivity.

By embedding proteins with a language model trained implicitly on structural similarity data, PLMSearch captures evolutionary signals encoded in sequence co-variation patterns rather than raw identity. The result is a search tool that achieves sensitivity comparable to structure-based search methods — which require expensive coordinate inputs — while accepting only sequence as input. It is paired with a complementary alignment module, PLMAlign, for high-quality pairwise alignment of pre-filtered candidates.

Key Features

Remote homology detection: Achieves over threefold greater sensitivity than MMseqs2 on the SCOPe40 benchmark for detecting remote homologs with low sequence identity but conserved folds.
Sequence-only input: Requires only amino acid sequence, with no need for experimental or predicted structures, making it broadly applicable to any protein database or metagenomic dataset.
Two-stage search pipeline: A fast protein-level pre-filtering step (PLMSearch) narrows millions of candidates; a slower residue-level alignment step (PLMAlign) then scores the shortlist for high-quality hits.
Structure-informed training: The similarity predictor is trained on structural similarity labels rather than sequence identity, enabling the model to learn relationships that sequence-based metrics obscure.
Database-scale speed: Pre-filtering across millions of query-target pairs completes in seconds, making genome- and proteome-wide searches computationally tractable.
Competitive with structure search: Matches the sensitivity of state-of-the-art structure-based search methods on benchmark datasets while avoiding the computational cost of structure prediction as a prerequisite.

Technical Details

PLMSearch is built around two pre-trained protein language models. The pre-filtering component uses ESM-1b (650 million parameters) to generate fixed-length per-protein embeddings; a learned similarity predictor scores pairs from these embeddings to rapidly identify high-confidence candidate homologs. The alignment component, PLMAlign, uses ProtT5-XL-UniRef50 (3 billion parameters, 1024-dimensional per-residue embeddings) to produce residue-level representations for detailed alignment of the pre-filtered pairs, analogous to how Smith-Waterman alignment follows BLAST pre-filtering.

Training uses a structural similarity framework in which pairwise TM-scores computed from known PDB structures serve as supervision labels. This cross-modal strategy — training a sequence model on structural targets — is what allows PLMSearch to generalize beyond the sequence similarity regime. On the SCOPe40-test benchmark, PLMSearch demonstrated threefold sensitivity improvement over MMseqs2, and on Swiss-Prot database searches it maintained accuracy after filtering training homologs. The two-stage design keeps compute tractable: expensive residue-level alignment is only applied to the small fraction of pairs that pass the protein-level filter.

Applications

PLMSearch is well suited to any research task that requires identifying distantly related proteins across large databases. Functional annotation workflows benefit directly: detecting remote homologs of characterized proteins can transfer Gene Ontology terms and active-site annotations to uncharacterized sequences. Evolutionary studies gain access to more complete homolog sets for phylogenetic reconstruction and identification of conserved functional domains. In structural biology, the method can supplement template-based structure prediction by surfacing remote structural templates that MMseqs2-based searches miss. For drug discovery, finding remote homologs of therapeutic targets can reveal off-target binding partners or structurally related proteins suitable for repurposing campaigns. The method is also efficient enough for metagenomic contexts, where millions of partial or divergent sequences need to be classified against reference databases.

Impact

PLMSearch demonstrates that embedding proteins with language models trained on structural data substantially extends the reach of sequence search, closing a long-standing gap between sequence- and structure-based homology detection. Its publication in Nature Communications and availability through an open GitHub repository have made it accessible to researchers who rely on MMseqs2-style workflows but need greater sensitivity for divergent proteins. A current limitation is that the approach depends on the quality of the underlying language model embeddings, which may be less reliable for highly unusual sequence compositions or very short fragments. As protein language models continue to scale and improve, PLMSearch-style frameworks are likely to become a standard first step in large-scale homology analysis pipelines.

Citation

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

Liu, W., Wang, Z., You, R., Xie, C., Wei, H., Xiong, Y., Yang, J., & Zhu, S. (2024). PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nature Communications, 15(1), 2775.

DOI: 10.1038/s41467-024-46808-5

Overview

Key Features

Remote homology detection: Achieves over threefold greater sensitivity than MMseqs2 on the SCOPe40 benchmark for detecting remote homologs with low sequence identity but conserved folds.

Sequence-only input: Requires only amino acid sequence, with no need for experimental or predicted structures, making it broadly applicable to any protein database or metagenomic dataset.

Two-stage search pipeline: A fast protein-level pre-filtering step (PLMSearch) narrows millions of candidates; a slower residue-level alignment step (PLMAlign) then scores the shortlist for high-quality hits.

Structure-informed training: The similarity predictor is trained on structural similarity labels rather than sequence identity, enabling the model to learn relationships that sequence-based metrics obscure.

Database-scale speed: Pre-filtering across millions of query-target pairs completes in seconds, making genome- and proteome-wide searches computationally tractable.

Competitive with structure search: Matches the sensitivity of state-of-the-art structure-based search methods on benchmark datasets while avoiding the computational cost of structure prediction as a prerequisite.

Technical Details

Applications

Impact

Citation

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

DOI: 10.1038/s41467-024-46808-5

PLMSearch

Overview

Key Features

Technical Details

Applications

Impact

Citation

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

Metrics

GitHub

Citations

Tags

Resources

PLMSearch

Overview

Key Features

Technical Details

Applications

Impact

Citation

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

Metrics

GitHub

Citations

Tags

Resources