Tri-modal protein language model unifying sequence, structure, and function via contrastive learning, enabling natural-language protein search across billions of entries.
ProTrek is a tri-modal protein language model developed at Westlake University that jointly learns from three complementary data types: amino acid sequences, 3D structures, and natural-language functional descriptions. Published in Nature Biotechnology in 2025, the model addresses a fundamental gap in protein informatics — the inability to search the protein universe using all three modalities simultaneously. Traditional tools like MMseqs2 or Foldseek operate within a single modality; ProTrek bridges all three through a unified contrastive learning framework.
The architecture aligns sequence, structure, and function representations in a shared embedding space via three pairwise contrastive objectives: sequence-structure, sequence-function, and structure-function alignment. This design enables nine distinct cross-modal search tasks — every pairwise combination of the three modalities, in both directions, plus within-modality retrieval. A researcher can, for example, input a natural-language query such as "serine protease involved in blood coagulation" and retrieve structurally and functionally relevant proteins from a database of billions in seconds.
ProTrek is available in two size variants and ships with precomputed embeddings covering over five billion proteins from NCBI, GOPC, and OMG/MGnify metagenomic repositories, accessible through a public web server.
ProTrek comprises three modality-specific encoders. The sequence encoder is based on the ESM-2 architecture (35M or 650M parameters depending on model variant). The structure encoder follows the Foldseek architecture (35M or 150M parameters). The text encoder is derived from BiomedNLP-PubMedBERT (130M parameters), providing semantic grounding in biomedical language. Total parameter counts are approximately 200M for ProTrek_35M and 930M for ProTrek_650M.
Training used Swiss-Prot (manually curated, high-quality annotations) and TrEMBL50 (large-scale automatic annotations) as primary datasets. Contrastive alignment is implemented with temperature-scaled similarity scoring; the mutual supervision strategy ensures all three modalities converge in the same latent space, enabling bidirectional translation between any pair. Precomputed embeddings span NCBI (700M sequences), GOPC (2B sequences), and OMG/MGnify (3B+ metagenomic sequences), indexed with FAISS for sub-second approximate nearest-neighbor retrieval at billion scale.
ProTrek is particularly well suited to protein discovery workflows where the query is functional rather than sequence-based. Drug discovery teams can search for novel proteins sharing functional annotations with known targets using natural-language descriptions, bypassing the requirement for a seed sequence or structure. Structural biologists benefit from cross-modal retrieval — a newly resolved structure can be searched against functionally annotated databases directly. Metagenomic researchers gain access to precomputed embeddings for over three billion environmental protein sequences, enabling rapid functional annotation of proteins with no known homologs. The model's strong transfer performance also makes it a practical drop-in replacement for ESM-2 as a general-purpose protein encoder in supervised prediction pipelines.
ProTrek's publication in Nature Biotechnology marks a significant step toward integrating all three primary axes of protein information — sequence, structure, and function — within a single retrieval framework. The publicly accessible search server and open model weights lower the barrier to adoption, particularly for experimental biologists who lack the infrastructure to run local homology searches. The model's demonstrated improvements over ESM-2 on downstream tasks suggest that multimodal pre-training provides richer protein representations than sequence-only approaches. A current limitation is that ProTrek encodes function through text annotations, meaning proteins with sparse or inaccurate database annotations may be poorly represented in the functional modality; retrieval quality therefore depends in part on the quality of the underlying protein databases.
Su, J., Zhou, X., Zhang, X., & Yuan, F. (2025). ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning. Nature Biotechnology.
DOI: 10.1038/s41587-025-02836-0