Overview

ProTrek is a tri-modal protein language model developed at Westlake University that jointly learns from three complementary data types: amino acid sequences, 3D structures, and natural-language functional descriptions. Published in Nature Biotechnology in 2025, the model addresses a fundamental gap in protein informatics — the inability to search the protein universe using all three modalities simultaneously. Traditional tools like MMseqs2 or Foldseek operate within a single modality; ProTrek bridges all three through a unified contrastive learning framework.

The architecture aligns sequence, structure, and function representations in a shared embedding space via three pairwise contrastive objectives: sequence-structure, sequence-function, and structure-function alignment. This design enables nine distinct cross-modal search tasks — every pairwise combination of the three modalities, in both directions, plus within-modality retrieval. A researcher can, for example, input a natural-language query such as "serine protease involved in blood coagulation" and retrieve structurally and functionally relevant proteins from a database of billions in seconds.

ProTrek is available in two size variants and ships with precomputed embeddings covering over five billion proteins from NCBI, GOPC, and OMG/MGnify metagenomic repositories, accessible through a public web server.

Key Features

Tri-modal contrastive learning: Aligns sequence, structure, and function encoders in a single shared embedding space, supporting all nine pairwise cross-modal retrieval tasks without task-specific fine-tuning.
Natural-language protein search: Accepts plain-English functional queries against a precomputed index of 5+ billion proteins, lowering the barrier for biologists without computational expertise.
Retrieval speed and accuracy gains: Achieves 30x improvement in sequence-to-function retrieval and 60x in function-to-sequence retrieval relative to baseline methods, with overall search speed 100x faster than Foldseek or MMseqs2.
Strong transfer representations: Outperforms ESM-2 (650M parameters) on 9 of 11 supervised downstream prediction benchmarks, including binding site prediction, subcellular localization, and thermostability.
Billion-scale deployment: Precomputed FAISS-indexed embeddings for over five billion proteins are served through the ProTrek search server; local deployment against custom databases is also supported.

Technical Details

ProTrek comprises three modality-specific encoders. The sequence encoder is based on the ESM-2 architecture (35M or 650M parameters depending on model variant). The structure encoder follows the Foldseek architecture (35M or 150M parameters). The text encoder is derived from BiomedNLP-PubMedBERT (130M parameters), providing semantic grounding in biomedical language. Total parameter counts are approximately 200M for ProTrek_35M and 930M for ProTrek_650M.

Training used Swiss-Prot (manually curated, high-quality annotations) and TrEMBL50 (large-scale automatic annotations) as primary datasets. Contrastive alignment is implemented with temperature-scaled similarity scoring; the mutual supervision strategy ensures all three modalities converge in the same latent space, enabling bidirectional translation between any pair. Precomputed embeddings span NCBI (700M sequences), GOPC (2B sequences), and OMG/MGnify (3B+ metagenomic sequences), indexed with FAISS for sub-second approximate nearest-neighbor retrieval at billion scale.

Applications

ProTrek is particularly well suited to protein discovery workflows where the query is functional rather than sequence-based. Drug discovery teams can search for novel proteins sharing functional annotations with known targets using natural-language descriptions, bypassing the requirement for a seed sequence or structure. Structural biologists benefit from cross-modal retrieval — a newly resolved structure can be searched against functionally annotated databases directly. Metagenomic researchers gain access to precomputed embeddings for over three billion environmental protein sequences, enabling rapid functional annotation of proteins with no known homologs. The model's strong transfer performance also makes it a practical drop-in replacement for ESM-2 as a general-purpose protein encoder in supervised prediction pipelines.

Impact

ProTrek's publication in Nature Biotechnology marks a significant step toward integrating all three primary axes of protein information — sequence, structure, and function — within a single retrieval framework. The publicly accessible search server and open model weights lower the barrier to adoption, particularly for experimental biologists who lack the infrastructure to run local homology searches. The model's demonstrated improvements over ESM-2 on downstream tasks suggest that multimodal pre-training provides richer protein representations than sequence-only approaches. A current limitation is that ProTrek encodes function through text annotations, meaning proteins with sparse or inaccurate database annotations may be poorly represented in the functional modality; retrieval quality therefore depends in part on the quality of the underlying protein databases.

Overview

Key Features

Tri-modal contrastive learning: Aligns sequence, structure, and function encoders in a single shared embedding space, supporting all nine pairwise cross-modal retrieval tasks without task-specific fine-tuning.

Natural-language protein search: Accepts plain-English functional queries against a precomputed index of 5+ billion proteins, lowering the barrier for biologists without computational expertise.

Retrieval speed and accuracy gains: Achieves 30x improvement in sequence-to-function retrieval and 60x in function-to-sequence retrieval relative to baseline methods, with overall search speed 100x faster than Foldseek or MMseqs2.

Strong transfer representations: Outperforms ESM-2 (650M parameters) on 9 of 11 supervised downstream prediction benchmarks, including binding site prediction, subcellular localization, and thermostability.

Billion-scale deployment: Precomputed FAISS-indexed embeddings for over five billion proteins are served through the ProTrek search server; local deployment against custom databases is also supported.

Technical Details

Applications

Impact

ProTrek

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

Metrics

GitHub

Citations

HuggingFace

Tags

Resources

ProTrek

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

Metrics

GitHub

Citations

HuggingFace

Tags

Resources