ProtAlign

Cross-modal protein encoder that aligns ESM-2 sequence embeddings with ProteinMPNN structure embeddings in a shared space for cross-modal retrieval.

Released: March 2026

ProtAlign is a contrastive cross-modal learning framework that aligns protein sequence and structure representations into a single shared embedding space. Proteins are inherently described by two complementary modalities — their amino acid sequence and their three-dimensional fold — yet most representation-learning pipelines model each in isolation, leaving the correspondence between them unexploited. ProtAlign addresses this gap by borrowing the contrastive alignment paradigm popularized by CLIP and applying it to pretrained protein encoders, so that a sequence and its matching structure map to nearby points in a common latent space.

The method was introduced in March 2026 by Aditya Ranganath, Hasin Us Sami, Kowshik Thopalli, Bhavya Kailkhura, and Wesam Sakla at Lawrence Livermore National Laboratory (Center for Applied Scientific Computing and Data Science Institute). Rather than training new encoders from scratch, ProtAlign reuses two established models — ESM-2 for sequences and ProteinMPNN for structures — and learns lightweight projection heads that bring their outputs into agreement.

This places ProtAlign alongside other cross-modal protein efforts such as CCPL and OneProt, but with a deliberately compact design. The paper is a short (~5-page) report demonstrating the feasibility of the approach, with a headline cross-modal retrieval result and a roadmap toward downstream applications.

Key Features

Cross-modal alignment: Maps ESM-2 sequence embeddings and ProteinMPNN structure embeddings into a shared 128-dimensional space, enabling sequence-to-structure retrieval and vice versa.
Frozen-encoder design: Leverages pretrained ESM-2 and ProteinMPNN, learning only modality-specific projection heads rather than retraining large backbones.
Multi-head attention projections: Each modality is summarized via a multi-head self-attention layer with learnable query tokens (D=128, 4 heads) followed by LayerNorm before alignment.
Flexible contrastive objective: Supports both CLIP-style softmax and SigLIP-style sigmoid losses; the CLIP loss at temperature 0.07 gives the best reported performance.
Interpretable embeddings: The aligned space clusters structurally similar sequences, offering a basis for downstream function and stability analysis.

Technical Details

ProtAlign extracts per-protein representations from a frozen ESM-2 sequence encoder and a frozen ProteinMPNN structure encoder, then passes each through a modality-specific multi-head self-attention projection (embedding dimension D=128, L=4 heads, LayerNorm output). The two projected representations are aligned with a contrastive loss that maximizes agreement between matched sequence–structure pairs while pushing apart unmatched pairs. Both a softmax (CLIP) and a sigmoid (SigLIP) formulation are evaluated, with the CLIP variant at temperature τ=0.07 performing best. Training uses the Adam optimizer (learning rate 0.001, batch size 1024). The model is trained and evaluated on protein chains drawn from PDBBind — 10,071 deduplicated training sequences, 3,387 validation, and 215 test — using experimentally resolved structures (ligand data excluded). On cross-modal retrieval, ProtAlign reaches Recall@5 of 99.1% and Recall@1 of 42.7% with the CLIP loss (97.6% / 40.0% with SigLIP).

Applications

ProtAlign targets tasks where reasoning jointly over sequence and structure is valuable. Its primary demonstrated use is cross-modal retrieval — for example, finding the structural neighbors of a query sequence — which is useful for annotation transfer, homolog discovery, and dataset curation. The authors position the shared embedding space as a foundation for downstream predictors of protein function and stability, where a unified representation could benefit protein engineering and design workflows. Because it reuses off-the-shelf encoders, the approach is comparatively inexpensive to adopt for groups already working with ESM-2 and ProteinMPNN.

Impact

ProtAlign is an early-stage, proof-of-concept contribution that shows a simple contrastive head can tightly couple two widely used protein encoders, achieving near-perfect top-5 cross-modal retrieval on PDBBind. As a short report it validates feasibility rather than establishing a large-scale foundation model: the training set is modest, downstream function and stability tasks are described as future work rather than benchmarked, and the lower Recall@1 indicates retrieval is far from exact at the single-match level. The authors state that code will be released upon acceptance; as of this writing no public weights or repository have been confirmed. Its main significance is methodological — reinforcing that CLIP-style alignment is a practical, lightweight route to multimodal protein representations.

Citation

ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Preprint

Ranganath, A., et al. (2026) ProtAlign: Contrastive learning paradigm for Sequence and structure alignment.

DOI: 10.48550/arXiv.2603.06722

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References15

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

35Closed

Usability — can I run it?27

Reproducibility — can I retrain it?33

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Cross-modal alignment: Maps ESM-2 sequence embeddings and ProteinMPNN structure embeddings into a shared 128-dimensional space, enabling sequence-to-structure retrieval and vice versa.

Frozen-encoder design: Leverages pretrained ESM-2 and ProteinMPNN, learning only modality-specific projection heads rather than retraining large backbones.

Multi-head attention projections: Each modality is summarized via a multi-head self-attention layer with learnable query tokens (D=128, 4 heads) followed by LayerNorm before alignment.

Flexible contrastive objective: Supports both CLIP-style softmax and SigLIP-style sigmoid losses; the CLIP loss at temperature 0.07 gives the best reported performance.

Interpretable embeddings: The aligned space clusters structurally similar sequences, offering a basis for downstream function and stability analysis.

Technical Details

Applications

Impact

ProtAlign

Key Features

Technical Details

Applications

Impact

Citation

ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

ProtAlign

Key Features

Technical Details

Applications

Impact

Citation

ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

ProtAlign

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

ProtAlign

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProtAlign: Contrastive learning paradigm for Sequence and structure alignment

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact