TEA

Biozentrum / University of Basel / SIB Swiss Institute of Bioinformatics

Protein sequence encoder that maps ESM2 embeddings to a learned 20-letter alphabet for structure-quality remote homology detection at MMseqs2 speed.

Released: November 2025

Parameters: 1.7 Million

TEA (The Embedded Alphabet) is a pretrained model from the Schwede and Durairaj groups at the Biozentrum, University of Basel, and the SIB Swiss Institute of Bioinformatics that rewrites protein sequences into a learned 20-letter alphabet, enabling structure-quality remote homology detection without ever computing or supplying a structure. Released as a bioRxiv preprint in late 2025, it targets a long-standing tension in sequence search: fast aligners such as BLAST and MMseqs2 scale to billions of sequences but lose sensitivity in the "twilight zone" of low sequence identity, while structure-based methods such as Foldseek recover distant relationships at the cost of needing 3D coordinates.

TEA bridges these regimes by distilling the rich representations of a protein language model into a discrete symbolic encoding. It takes per-residue embeddings from ESM2 (esm2_t33_650M_UR50D) and, via a small contrastively trained adapter, maps each residue to one of 20 new letters that capture structural and evolutionary context rather than raw amino acid identity. Because the output is an ordinary letter string, the converted "tea-FASTA" sequences can be fed directly into the standard MMseqs2 search engine, inheriting its mature, highly optimized speed while encoding the kind of information that normally requires a structure to access.

The model ships as a fixed checkpoint that is applied zero-shot — there is no per-task fine-tuning — and is distinct from sequence-search methods like PLMSearch that compare language-model embeddings directly. TEA's contribution is the alphabet itself, plus a paired substitution matrix that makes the encoded sequences searchable with off-the-shelf alignment tooling.

Key Features

Learned 20-letter alphabet: A contrastively trained adapter converts ESM2 per-residue embeddings into a novel 20-symbol code ("tea-FASTA") that encodes structural and homology signal while remaining a plain text string.
Structure-quality, sequence-only search: Detects remote homologs approaching the sensitivity of structure-based methods while requiring only amino acid sequence as input — no predicted or experimental coordinates.
MMseqs2-speed retrieval: Because the output is a standard letter alphabet paired with a dedicated MATCHA substitution matrix, searches run inside the unmodified MMseqs2 engine at conventional sequence-search throughput.
Per-residue confidence: The model emits per-position logits and entropy, surfacing low-confidence residues (rendered in lowercase) so users can weight uncertain regions.
Fixed, zero-shot checkpoint: A single released model is used directly on tasks such as SCOPe remote homology without retraining, simplifying reproducibility and deployment.
STEAM companion tool: A separate fast-search package (PickyBinders/steam) combines TEA representations with standard amino acid information to accelerate searches and generate alignments.

Technical Details

TEA is a lightweight adapter (roughly 1.67M trainable parameters) layered on top of the frozen ESM2 650M-parameter transformer (esm2_t33_650M_UR50D). Training uses a contrastive objective so that residues sharing structural or evolutionary context are assigned consistent symbols from the 20-letter alphabet; the resulting encoder is distilled into a discrete vocabulary together with the MATCHA substitution matrix used for scoring alignments. Converted sequences are searched with MMseqs2, which means TEA reuses an existing, heavily engineered alignment pipeline rather than introducing a new search algorithm. The authors evaluate the approach on SCOPe remote homology detection, where the alphabet recovers distant fold relationships that sequence-identity-based search misses, while preserving the linear-time, database-scale performance characteristic of MMseqs2. The model is pip-installable (pip install git+https://github.com/PickyBinders/tea), released under the MIT license with weights on Hugging Face, and exposed through a web server at pickybinders.org/tea.

Applications

TEA is suited to any workflow that needs to find distantly related proteins across large sequence collections without the cost of structure prediction: functional annotation transfer, template discovery for structure modeling, metagenomic and proteome-scale homolog search, and evolutionary analyses that depend on complete homolog sets. A companion application paper, "Reading TEA leaves for de novo protein design" (Pantolini & Durairaj, ICLR 2026 LMRL workshop), demonstrates using the alphabet in protein design contexts, indicating the representation is useful beyond retrieval. The fixed, sequence-only checkpoint makes it straightforward to drop into existing MMseqs2-based pipelines.

Impact

By turning protein-language-model knowledge into a plain alphabet that standard aligners can search, TEA narrows the gap between fast sequence search and sensitive structure-based search while sidestepping the need for 3D coordinates. The strategy is notable for its pragmatism: rather than building a bespoke neural search system, it encodes learned signal into the input of mature tools like MMseqs2, lowering the barrier to adoption for groups already running large-scale homology searches. As a recent preprint its long-term influence is still emerging, and reported gains depend on the quality of the underlying ESM2 embeddings, but the alphabet-rewriting approach offers a reusable template for injecting representation-learning signal into classical bioinformatics infrastructure.

Citation

Rewriting protein alphabets with language models

Preprint

Pantolini, L., et al. (2026) Rewriting protein alphabets with language models. bioRxiv.

DOI: 10.1101/2025.11.27.690975

Recent citations

Papers that recently cited this model.

Evolving strategies for virus discovery
Amanda Araujo Serrao de Andrade, A. Silverj, Theo Josephs, et al.
Microbial Genomics · Jul 2026
0
Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics
Madeline Schmid, Yixiao Liu, Ashar J. Malik, et al.
bioRxiv · May 2026
0
Reading TEA leaves for de novo protein design
L. Pantolini, J. Durairaj
bioRxiv · Feb 2026
1

Top citations

The most-cited papers that cite this model.

Template-based RNA structure prediction advanced through a blind code competition
Youhan Lee, Shujun He, Toshiyuki Oda, et al.
bioRxiv · Dec 2025
2
Reading TEA leaves for de novo protein design
L. Pantolini, J. Durairaj
bioRxiv · Feb 2026
1
Know Your Alphabet: Conformational Noise, Latent-Space Encodings, and the Future of Structural Phylogenetics
Madeline Schmid, Yixiao Liu, Ashar J. Malik, et al.
bioRxiv · May 2026
0
Evolving strategies for virus discovery
Amanda Araujo Serrao de Andrade, A. Silverj, Theo Josephs, et al.
Microbial Genomics · Jul 2026
0

Citations

Total Citations2

Influential0

References55

GitHub

Stars24

Forks0

Open Issues0

Contributors2

Last Push1mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads4.5K

Likes2

Last Modified2mo ago

Fields of citing research

Biology100%
Computer Science100%
Medicine50%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?95

Reproducibility — can I retrain it?76

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Research Paper HuggingFace Model Demo

Key Features

Learned 20-letter alphabet: A contrastively trained adapter converts ESM2 per-residue embeddings into a novel 20-symbol code ("tea-FASTA") that encodes structural and homology signal while remaining a plain text string.

Structure-quality, sequence-only search: Detects remote homologs approaching the sensitivity of structure-based methods while requiring only amino acid sequence as input — no predicted or experimental coordinates.

MMseqs2-speed retrieval: Because the output is a standard letter alphabet paired with a dedicated MATCHA substitution matrix, searches run inside the unmodified MMseqs2 engine at conventional sequence-search throughput.

Per-residue confidence: The model emits per-position logits and entropy, surfacing low-confidence residues (rendered in lowercase) so users can weight uncertain regions.

Fixed, zero-shot checkpoint: A single released model is used directly on tasks such as SCOPe remote homology without retraining, simplifying reproducibility and deployment.

STEAM companion tool: A separate fast-search package (PickyBinders/steam) combines TEA representations with standard amino acid information to accelerate searches and generate alignments.

Technical Details

Applications

Impact

TEA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Rewriting protein alphabets with language models

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

TEA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Rewriting protein alphabets with language models

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact