bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

TM-Vec 2

Arizona State University

A distilled deep learning model that predicts structural similarity between proteins directly from sequence, reaching up to 258x speedups for large-scale homology search.

Released: February 2026

TM-Vec 2 is a deep learning method for detecting protein structural homology directly from amino acid sequence, without first predicting or aligning three-dimensional structures. Structural similarity often reveals evolutionary and functional relationships that sequence identity alone misses, but measuring it traditionally requires structures and structure-alignment tools such as TM-align or Foldseek. TM-Vec 2 instead encodes proteins into vectors whose distances approximate structural similarity scores, so that homology search reduces to fast nearest-neighbor lookups over embeddings.

The model is the successor to the original TM-Vec, developed by Keluskar, Batra, Bezshapkin, Morton, and Zhu at Arizona State University and released as a February 2026 bioRxiv preprint. Its headline contribution is efficiency: a distilled variant, TM-Vec 2s, achieves up to 258x speedup over the original TM-Vec and up to 56x speedup over Foldseek for large-scale database queries, while reportedly improving accuracy. This makes structure-aware search practical at the scale of modern protein databases that now contain hundreds of millions of sequences.

TM-Vec 2 fits into the landscape of structure-informed search tools alongside Foldseek and the original TM-Vec, occupying the niche of sequence-only structural homology detection where no experimental or predicted structure is required at query time.

#Key Features

  • Sequence-only structural search: TM-Vec 2 predicts structural similarity (TM-score-like) between proteins from sequence alone, removing the need to compute or align 3D structures for each query.
  • Large speedups via distillation: The distilled TM-Vec 2s variant reaches up to 258x speedup over the original TM-Vec and up to 56x over Foldseek on large-scale queries.
  • Improved accuracy: Despite being faster, TM-Vec 2s is reported to achieve higher accuracy than the original TM-Vec on structural similarity benchmarks.
  • Embedding-based retrieval: Proteins are mapped to fixed-length vectors, so homology search becomes efficient nearest-neighbor lookup that scales to very large databases.
  • Two model variants: A full TM-Vec 2 model and a distilled TM-Vec 2s model let users trade maximum accuracy against maximum throughput.

#Technical Details

TM-Vec 2 builds on protein language model embeddings, encoding each sequence into a vector such that the distance between two vectors approximates their structural similarity score, which is then used for fast retrieval. The authors introduce a distilled student model, TM-Vec 2s, trained to reproduce the behavior of the larger model at much lower compute, which is the source of the reported throughput gains. On large-scale database queries, TM-Vec 2s reaches up to 258x speedup relative to the original TM-Vec and up to 56x relative to Foldseek, while reporting higher accuracy than the original TM-Vec. The preprint details the training data, the structural similarity targets used for supervision, and the benchmark protocols; it is released under a CC BY-NC-ND license. As a recent preprint, public availability of code and trained weights should be confirmed from the authors before use.

#Applications

TM-Vec 2 is aimed at researchers performing large-scale protein homology and function-annotation searches, including metagenomic and proteome-wide studies where most sequences lack experimentally determined structures. Because it operates from sequence and returns structure-aware matches quickly, it is well suited to annotating large collections of uncharacterized proteins, discovering remote homologs that escape sequence-based search, and clustering proteins by structural relatedness. The faster TM-Vec 2s variant is particularly useful for all-against-all comparisons across very large databases.

#Impact

By delivering structure-aware homology search at speeds approaching or exceeding fast structure-alignment tools, TM-Vec 2 lowers the cost of incorporating structural similarity into routine large-scale protein analysis. The distillation strategy that yields TM-Vec 2s illustrates how a smaller student model can retain accuracy while dramatically improving throughput, a pattern increasingly relevant as protein databases grow. As a February 2026 preprint, the reported speedup and accuracy figures come from the authors and await independent benchmarking; performance on the most remote homologs and on proteins poorly represented in training data remains to be characterized externally.

Tags

homology_detectionstructure_predictiontransformerrepresentation_learningembeddingsproteomics