Protein language model-based sequence search that detects remote homologs with threefold higher sensitivity than MMseqs2 at comparable speed.
PLMSearch is a homologous protein search method developed at Fudan University that replaces traditional sequence-alignment heuristics with deep representations from pre-trained protein language models. Published in Nature Communications in 2024, it addresses a persistent limitation of tools like BLAST and MMseqs2: their reliance on sequence similarity scores that become unreliable when evolutionary divergence is high. Proteins can share a common fold and function while retaining less than 20% sequence identity — a regime where conventional methods lose most of their sensitivity.
By embedding proteins with a language model trained implicitly on structural similarity data, PLMSearch captures evolutionary signals encoded in sequence co-variation patterns rather than raw identity. The result is a search tool that achieves sensitivity comparable to structure-based search methods — which require expensive coordinate inputs — while accepting only sequence as input. It is paired with a complementary alignment module, PLMAlign, for high-quality pairwise alignment of pre-filtered candidates.
PLMSearch is built around two pre-trained protein language models. The pre-filtering component uses ESM-1b (650 million parameters) to generate fixed-length per-protein embeddings; a learned similarity predictor scores pairs from these embeddings to rapidly identify high-confidence candidate homologs. The alignment component, PLMAlign, uses ProtT5-XL-UniRef50 (3 billion parameters, 1024-dimensional per-residue embeddings) to produce residue-level representations for detailed alignment of the pre-filtered pairs, analogous to how Smith-Waterman alignment follows BLAST pre-filtering.
Training uses a structural similarity framework in which pairwise TM-scores computed from known PDB structures serve as supervision labels. This cross-modal strategy — training a sequence model on structural targets — is what allows PLMSearch to generalize beyond the sequence similarity regime. On the SCOPe40-test benchmark, PLMSearch demonstrated threefold sensitivity improvement over MMseqs2, and on Swiss-Prot database searches it maintained accuracy after filtering training homologs. The two-stage design keeps compute tractable: expensive residue-level alignment is only applied to the small fraction of pairs that pass the protein-level filter.
PLMSearch is well suited to any research task that requires identifying distantly related proteins across large databases. Functional annotation workflows benefit directly: detecting remote homologs of characterized proteins can transfer Gene Ontology terms and active-site annotations to uncharacterized sequences. Evolutionary studies gain access to more complete homolog sets for phylogenetic reconstruction and identification of conserved functional domains. In structural biology, the method can supplement template-based structure prediction by surfacing remote structural templates that MMseqs2-based searches miss. For drug discovery, finding remote homologs of therapeutic targets can reveal off-target binding partners or structurally related proteins suitable for repurposing campaigns. The method is also efficient enough for metagenomic contexts, where millions of partial or divergent sequences need to be classified against reference databases.
PLMSearch demonstrates that embedding proteins with language models trained on structural data substantially extends the reach of sequence search, closing a long-standing gap between sequence- and structure-based homology detection. Its publication in Nature Communications and availability through an open GitHub repository have made it accessible to researchers who rely on MMseqs2-style workflows but need greater sensitivity for divergent proteins. A current limitation is that the approach depends on the quality of the underlying language model embeddings, which may be less reliable for highly unusual sequence compositions or very short fragments. As protein language models continue to scale and improve, PLMSearch-style frameworks are likely to become a standard first step in large-scale homology analysis pipelines.
Liu, W., Wang, Z., You, R., Xie, C., Wei, H., Xiong, Y., Yang, J., & Zhu, S. (2024). PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nature Communications, 15(1), 2775.
DOI: 10.1038/s41467-024-46808-5