bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

TEA

Biozentrum / University of Basel / SIB Swiss Institute of Bioinformatics

Contrastive model mapping ESM2 protein-language-model embeddings to a 20-letter alphabet for structure-quality remote homology detection at MMseqs2 speed, sequence-only.

Released: November 2025
Parameters: 1.7 Million

TEA (The Embedded Alphabet) is a pretrained model from the Schwede and Durairaj groups at the Biozentrum, University of Basel, and the SIB Swiss Institute of Bioinformatics that rewrites protein sequences into a learned 20-letter alphabet, enabling structure-quality remote homology detection without ever computing or supplying a structure. Released as a bioRxiv preprint in late 2025, it targets a long-standing tension in sequence search: fast aligners such as BLAST and MMseqs2 scale to billions of sequences but lose sensitivity in the "twilight zone" of low sequence identity, while structure-based methods such as Foldseek recover distant relationships at the cost of needing 3D coordinates.

TEA bridges these regimes by distilling the rich representations of a protein language model into a discrete symbolic encoding. It takes per-residue embeddings from ESM2 (esm2_t33_650M_UR50D) and, via a small contrastively trained adapter, maps each residue to one of 20 new letters that capture structural and evolutionary context rather than raw amino acid identity. Because the output is an ordinary letter string, the converted "tea-FASTA" sequences can be fed directly into the standard MMseqs2 search engine, inheriting its mature, highly optimized speed while encoding the kind of information that normally requires a structure to access.

The model ships as a fixed checkpoint that is applied zero-shot — there is no per-task fine-tuning — and is distinct from sequence-search methods like PLMSearch that compare language-model embeddings directly. TEA's contribution is the alphabet itself, plus a paired substitution matrix that makes the encoded sequences searchable with off-the-shelf alignment tooling.

#Key Features

  • Learned 20-letter alphabet: A contrastively trained adapter converts ESM2 per-residue embeddings into a novel 20-symbol code ("tea-FASTA") that encodes structural and homology signal while remaining a plain text string.
  • Structure-quality, sequence-only search: Detects remote homologs approaching the sensitivity of structure-based methods while requiring only amino acid sequence as input — no predicted or experimental coordinates.
  • MMseqs2-speed retrieval: Because the output is a standard letter alphabet paired with a dedicated MATCHA substitution matrix, searches run inside the unmodified MMseqs2 engine at conventional sequence-search throughput.
  • Per-residue confidence: The model emits per-position logits and entropy, surfacing low-confidence residues (rendered in lowercase) so users can weight uncertain regions.
  • Fixed, zero-shot checkpoint: A single released model is used directly on tasks such as SCOPe remote homology without retraining, simplifying reproducibility and deployment.
  • STEAM companion tool: A separate fast-search package (PickyBinders/steam) combines TEA representations with standard amino acid information to accelerate searches and generate alignments.

#Technical Details

TEA is a lightweight adapter (roughly 1.67M trainable parameters) layered on top of the frozen ESM2 650M-parameter transformer (esm2_t33_650M_UR50D). Training uses a contrastive objective so that residues sharing structural or evolutionary context are assigned consistent symbols from the 20-letter alphabet; the resulting encoder is distilled into a discrete vocabulary together with the MATCHA substitution matrix used for scoring alignments. Converted sequences are searched with MMseqs2, which means TEA reuses an existing, heavily engineered alignment pipeline rather than introducing a new search algorithm. The authors evaluate the approach on SCOPe remote homology detection, where the alphabet recovers distant fold relationships that sequence-identity-based search misses, while preserving the linear-time, database-scale performance characteristic of MMseqs2. The model is pip-installable (pip install git+https://github.com/PickyBinders/tea), released under the MIT license with weights on Hugging Face, and exposed through a web server at pickybinders.org/tea.

#Applications

TEA is suited to any workflow that needs to find distantly related proteins across large sequence collections without the cost of structure prediction: functional annotation transfer, template discovery for structure modeling, metagenomic and proteome-scale homolog search, and evolutionary analyses that depend on complete homolog sets. A companion application paper, "Reading TEA leaves for de novo protein design" (Pantolini & Durairaj, ICLR 2026 LMRL workshop), demonstrates using the alphabet in protein design contexts, indicating the representation is useful beyond retrieval. The fixed, sequence-only checkpoint makes it straightforward to drop into existing MMseqs2-based pipelines.

#Impact

By turning protein-language-model knowledge into a plain alphabet that standard aligners can search, TEA narrows the gap between fast sequence search and sensitive structure-based search while sidestepping the need for 3D coordinates. The strategy is notable for its pragmatism: rather than building a bespoke neural search system, it encodes learned signal into the input of mature tools like MMseqs2, lowering the barrier to adoption for groups already running large-scale homology searches. As a recent preprint its long-term influence is still emerging, and reported gains depend on the quality of the underlying ESM2 embeddings, but the alphabet-rewriting approach offers a reusable template for injecting representation-learning signal into classical bioinformatics infrastructure.

Citation

Rewriting protein alphabets with language models

Preprint

Pantolini, L., et al. (2026) Rewriting protein alphabets with language models. bioRxiv.

DOI: 10.1101/2025.11.27.690975

Openness

Unclassified
Missing required components

Tags

contrastive_learninghomology_detectionproteomicsrepresentation_learningsequence_searchtransformerzero_shot

Resources

GitHub RepositoryResearch PaperResearch PaperHuggingFace ModelDemo