Lawrence Livermore National Laboratory
A contrastive cross-modal encoder that aligns protein sequence (ESM-2) and structure (ProteinMPNN) representations into a shared embedding space for cross-modal retrieval.
ProtAlign is a contrastive cross-modal learning framework that aligns protein sequence and structure representations into a single shared embedding space. Proteins are inherently described by two complementary modalities — their amino acid sequence and their three-dimensional fold — yet most representation-learning pipelines model each in isolation, leaving the correspondence between them unexploited. ProtAlign addresses this gap by borrowing the contrastive alignment paradigm popularized by CLIP and applying it to pretrained protein encoders, so that a sequence and its matching structure map to nearby points in a common latent space.
The method was introduced in March 2026 by Aditya Ranganath, Hasin Us Sami, Kowshik Thopalli, Bhavya Kailkhura, and Wesam Sakla at Lawrence Livermore National Laboratory (Center for Applied Scientific Computing and Data Science Institute). Rather than training new encoders from scratch, ProtAlign reuses two established models — ESM-2 for sequences and ProteinMPNN for structures — and learns lightweight projection heads that bring their outputs into agreement.
This places ProtAlign alongside other cross-modal protein efforts such as CCPL and OneProt, but with a deliberately compact design. The paper is a short (~5-page) report demonstrating the feasibility of the approach, with a headline cross-modal retrieval result and a roadmap toward downstream applications.
ProtAlign extracts per-protein representations from a frozen ESM-2 sequence encoder and a frozen ProteinMPNN structure encoder, then passes each through a modality-specific multi-head self-attention projection (embedding dimension D=128, L=4 heads, LayerNorm output). The two projected representations are aligned with a contrastive loss that maximizes agreement between matched sequence–structure pairs while pushing apart unmatched pairs. Both a softmax (CLIP) and a sigmoid (SigLIP) formulation are evaluated, with the CLIP variant at temperature τ=0.07 performing best. Training uses the Adam optimizer (learning rate 0.001, batch size 1024). The model is trained and evaluated on protein chains drawn from PDBBind — 10,071 deduplicated training sequences, 3,387 validation, and 215 test — using experimentally resolved structures (ligand data excluded). On cross-modal retrieval, ProtAlign reaches Recall@5 of 99.1% and Recall@1 of 42.7% with the CLIP loss (97.6% / 40.0% with SigLIP).
ProtAlign targets tasks where reasoning jointly over sequence and structure is valuable. Its primary demonstrated use is cross-modal retrieval — for example, finding the structural neighbors of a query sequence — which is useful for annotation transfer, homolog discovery, and dataset curation. The authors position the shared embedding space as a foundation for downstream predictors of protein function and stability, where a unified representation could benefit protein engineering and design workflows. Because it reuses off-the-shelf encoders, the approach is comparatively inexpensive to adopt for groups already working with ESM-2 and ProteinMPNN.
ProtAlign is an early-stage, proof-of-concept contribution that shows a simple contrastive head can tightly couple two widely used protein encoders, achieving near-perfect top-5 cross-modal retrieval on PDBBind. As a short report it validates feasibility rather than establishing a large-scale foundation model: the training set is modest, downstream function and stability tasks are described as future work rather than benchmarked, and the lower Recall@1 indicates retrieval is far from exact at the single-match level. The authors state that code will be released upon acceptance; as of this writing no public weights or repository have been confirmed. Its main significance is methodological — reinforcing that CLIP-style alignment is a practical, lightweight route to multimodal protein representations.