A distilled deep learning model that predicts structural similarity between proteins directly from sequence, reaching up to 258x speedups for large-scale homology search.
TM-Vec 2 is a deep learning method for detecting protein structural homology directly from amino acid sequence, without first predicting or aligning three-dimensional structures. Structural similarity often reveals evolutionary and functional relationships that sequence identity alone misses, but measuring it traditionally requires structures and structure-alignment tools such as TM-align or Foldseek. TM-Vec 2 instead encodes proteins into vectors whose distances approximate structural similarity scores, so that homology search reduces to fast nearest-neighbor lookups over embeddings.
The model is the successor to the original TM-Vec, developed by Keluskar, Batra, Bezshapkin, Morton, and Zhu at Arizona State University and released as a February 2026 bioRxiv preprint. Its headline contribution is efficiency: a distilled variant, TM-Vec 2s, achieves up to 258x speedup over the original TM-Vec and up to 56x speedup over Foldseek for large-scale database queries, while reportedly improving accuracy. This makes structure-aware search practical at the scale of modern protein databases that now contain hundreds of millions of sequences.
TM-Vec 2 fits into the landscape of structure-informed search tools alongside Foldseek and the original TM-Vec, occupying the niche of sequence-only structural homology detection where no experimental or predicted structure is required at query time.
TM-Vec 2 builds on protein language model embeddings, encoding each sequence into a vector such that the distance between two vectors approximates their structural similarity score, which is then used for fast retrieval. The authors introduce a distilled student model, TM-Vec 2s, trained to reproduce the behavior of the larger model at much lower compute, which is the source of the reported throughput gains. On large-scale database queries, TM-Vec 2s reaches up to 258x speedup relative to the original TM-Vec and up to 56x relative to Foldseek, while reporting higher accuracy than the original TM-Vec. The preprint details the training data, the structural similarity targets used for supervision, and the benchmark protocols; it is released under a CC BY-NC-ND license. As a recent preprint, public availability of code and trained weights should be confirmed from the authors before use.
TM-Vec 2 is aimed at researchers performing large-scale protein homology and function-annotation searches, including metagenomic and proteome-wide studies where most sequences lack experimentally determined structures. Because it operates from sequence and returns structure-aware matches quickly, it is well suited to annotating large collections of uncharacterized proteins, discovering remote homologs that escape sequence-based search, and clustering proteins by structural relatedness. The faster TM-Vec 2s variant is particularly useful for all-against-all comparisons across very large databases.
By delivering structure-aware homology search at speeds approaching or exceeding fast structure-alignment tools, TM-Vec 2 lowers the cost of incorporating structural similarity into routine large-scale protein analysis. The distillation strategy that yields TM-Vec 2s illustrates how a smaller student model can retain accuracy while dramatically improving throughput, a pattern increasingly relevant as protein databases grow. As a February 2026 preprint, the reported speedup and accuracy figures come from the authors and await independent benchmarking; performance on the most remote homologs and on proteins poorly represented in training data remains to be characterized externally.