Biozentrum / University of Basel / SIB Swiss Institute of Bioinformatics
Contrastive model mapping ESM2 protein-language-model embeddings to a 20-letter alphabet for structure-quality remote homology detection at MMseqs2 speed, sequence-only.
TEA (The Embedded Alphabet) is a pretrained model from the Schwede and Durairaj groups at the Biozentrum, University of Basel, and the SIB Swiss Institute of Bioinformatics that rewrites protein sequences into a learned 20-letter alphabet, enabling structure-quality remote homology detection without ever computing or supplying a structure. Released as a bioRxiv preprint in late 2025, it targets a long-standing tension in sequence search: fast aligners such as BLAST and MMseqs2 scale to billions of sequences but lose sensitivity in the "twilight zone" of low sequence identity, while structure-based methods such as Foldseek recover distant relationships at the cost of needing 3D coordinates.
TEA bridges these regimes by distilling the rich representations of a protein language model into a discrete symbolic encoding. It takes per-residue embeddings from ESM2 (esm2_t33_650M_UR50D) and, via a small contrastively trained adapter, maps each residue to one of 20 new letters that capture structural and evolutionary context rather than raw amino acid identity. Because the output is an ordinary letter string, the converted "tea-FASTA" sequences can be fed directly into the standard MMseqs2 search engine, inheriting its mature, highly optimized speed while encoding the kind of information that normally requires a structure to access.
The model ships as a fixed checkpoint that is applied zero-shot — there is no per-task fine-tuning — and is distinct from sequence-search methods like PLMSearch that compare language-model embeddings directly. TEA's contribution is the alphabet itself, plus a paired substitution matrix that makes the encoded sequences searchable with off-the-shelf alignment tooling.
TEA is a lightweight adapter (roughly 1.67M trainable parameters) layered on top of the frozen ESM2 650M-parameter transformer (esm2_t33_650M_UR50D). Training uses a contrastive objective so that residues sharing structural or evolutionary context are assigned consistent symbols from the 20-letter alphabet; the resulting encoder is distilled into a discrete vocabulary together with the MATCHA substitution matrix used for scoring alignments. Converted sequences are searched with MMseqs2, which means TEA reuses an existing, heavily engineered alignment pipeline rather than introducing a new search algorithm. The authors evaluate the approach on SCOPe remote homology detection, where the alphabet recovers distant fold relationships that sequence-identity-based search misses, while preserving the linear-time, database-scale performance characteristic of MMseqs2. The model is pip-installable (pip install git+https://github.com/PickyBinders/tea), released under the MIT license with weights on Hugging Face, and exposed through a web server at pickybinders.org/tea.
TEA is suited to any workflow that needs to find distantly related proteins across large sequence collections without the cost of structure prediction: functional annotation transfer, template discovery for structure modeling, metagenomic and proteome-scale homolog search, and evolutionary analyses that depend on complete homolog sets. A companion application paper, "Reading TEA leaves for de novo protein design" (Pantolini & Durairaj, ICLR 2026 LMRL workshop), demonstrates using the alphabet in protein design contexts, indicating the representation is useful beyond retrieval. The fixed, sequence-only checkpoint makes it straightforward to drop into existing MMseqs2-based pipelines.
By turning protein-language-model knowledge into a plain alphabet that standard aligners can search, TEA narrows the gap between fast sequence search and sensitive structure-based search while sidestepping the need for 3D coordinates. The strategy is notable for its pragmatism: rather than building a bespoke neural search system, it encodes learned signal into the input of mature tools like MMseqs2, lowering the barrier to adoption for groups already running large-scale homology searches. As a recent preprint its long-term influence is still emerging, and reported gains depend on the quality of the underlying ESM2 embeddings, but the alphabet-rewriting approach offers a reusable template for injecting representation-learning signal into classical bioinformatics infrastructure.
Pantolini, L., et al. (2026) Rewriting protein alphabets with language models. bioRxiv.
DOI: 10.1101/2025.11.27.690975