CLASP

Tri-modal contrastive model aligning protein structure, sequence, and text in a shared space for zero-shot cross-modal retrieval and classification.

Released: August 2025

CLASP (Contrastive Language-Amino acid Sequence-Structure Pretraining) is a tri-modal representation-learning framework that embeds a protein's three-dimensional structure, amino-acid sequence, and natural-language description into a single shared vector space. Most protein representation models capture one or two modalities, such as sequence-only language models or structure encoders; CLASP instead aligns all three jointly so that information from any one modality can be retrieved or classified using any other.

The framework was developed by Nicolas Bolouri, Joseph Szymborski, and Amin Emad at McGill University and affiliated Montreal institutions (Mila, the Goodman Cancer Institute, and the Dahdaleh Institute of Genomic Medicine), with a preprint posted to bioRxiv in August 2025. It adapts a CLIP-style contrastive objective, generalized from prior multi-modal work, to the protein domain.

By training on a contrastive objective across modality pairs, CLASP learns biologically meaningful relationships: its structure and sequence embeddings cluster by protein family and functional class. This enables zero-shot tasks such as identifying the correct sequence given a structure or retrieving proteins from a text description, outperforming baselines limited to fewer modalities.

Key Features

Tri-modal alignment: Jointly embeds structure, sequence, and text description in one shared space, enabling retrieval and classification across any pair of modalities.
Zero-shot cross-modal retrieval: Matches structures to sequences, structures to descriptions, and sequences to descriptions without task-specific fine-tuning, outperforming single- and dual-modality baselines.
Biologically structured embeddings: Learned representations cluster by protein family and functional class, indicating capture of structure-function relationships.
Modality complementarity: Ablations show that removing any single modality during training degrades performance, confirming each contributes distinct signal.

Technical Details

CLASP encodes each modality with a dedicated backbone and aligns them through a contrastive loss. Structure is represented with an E(3)-invariant graph neural network (EGNN) operating on protein graphs built with Graphein; sequence is embedded with the ProtT5 protein language model; and natural-language descriptions are embedded with BioGPT. The resulting per-modality embeddings are projected into a shared space and trained with a tri-modal contrastive objective inspired by CLIP-style 3D contrastive pretraining. The authors report training on the order of 13 hours per run on a single NVIDIA RTX 3090 GPU. Evaluations cover zero-shot classification and retrieval across all modality pairs, with ablations quantifying the contribution of each modality. Code is released on GitHub under GPL-3.0.

Applications

CLASP is useful for protein annotation and search workflows where queries and targets live in different modalities, for example finding sequences or structures that match a functional text description, or annotating an uncharacterized structure by retrieving similar described proteins. Its joint embeddings can serve as features for downstream family or function classification, supporting researchers in functional genomics and protein characterization.

Impact

CLASP extends contrastive multimodal learning, popularized for images and text, into a genuinely tri-modal protein setting, demonstrating that structure, sequence, and language can be aligned in one space at modest compute. The released code lowers the barrier to reuse and extension. As a recent preprint, its broader influence will depend on independent benchmarking against established protein-text models, but it illustrates a practical recipe for unifying heterogeneous protein data for zero-shot retrieval and classification.

Citation

Multi-Modal Protein Representation Learning with CLASP

Preprint

Bolouri, N., et al. (2026) Multi-Modal Protein Representation Learning with CLASP. bioRxiv.

DOI: 10.1101/2025.08.10.669533

Recent citations

Papers that recently cited this model.

When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR–Peptide Binding Prediction
Cong Qi, Wenbo Wang, Hanzhang Fang, et al.
bioRxiv · Apr 2026
0
GATSBI: Improving context-aware protein embeddings through biologically motivated data splits
Gowri Nayar, R. Altman
bioRxiv · Feb 2026
0
A flaw in using pretrained protein language models in protein–protein interaction inference models
Joseph Szymborski, Amin Emad
Nature Machine Intelligence · Feb 2026
2

Top citations

The most-cited papers that cite this model.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al.
arXiv.org · Mar 2024
27
A flaw in using pretrained protein language models in protein–protein interaction inference models
Joseph Szymborski, Amin Emad
Nature Machine Intelligence · Feb 2026
2
GATSBI: Improving context-aware protein embeddings through biologically motivated data splits
Gowri Nayar, R. Altman
bioRxiv · Feb 2026
0
When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR–Peptide Binding Prediction
Cong Qi, Wenbo Wang, Hanzhang Fang, et al.
bioRxiv · Apr 2026
0

Citations

Total Citations4

Influential0

References33

GitHub

Stars4

Forks1

Open Issues0

Contributors2

Last Push6mo ago

LanguagePython

LicenseGPL-3.0

Fields of citing research

Biology100%
Computer Science100%
Chemistry25%
Medicine25%

Share of papers citing this model.

Openness

bio.rodeo opennessReproducible · reproducible, less usable

42Partial

Usability — can I run it?47

Reproducibility — can I retrain it?50

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Tri-modal alignment: Jointly embeds structure, sequence, and text description in one shared space, enabling retrieval and classification across any pair of modalities.

Zero-shot cross-modal retrieval: Matches structures to sequences, structures to descriptions, and sequences to descriptions without task-specific fine-tuning, outperforming single- and dual-modality baselines.

Biologically structured embeddings: Learned representations cluster by protein family and functional class, indicating capture of structure-function relationships.

Modality complementarity: Ablations show that removing any single modality during training degrades performance, confirming each contributes distinct signal.

Technical Details

Applications

Impact

CLASP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-Modal Protein Representation Learning with CLASP

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CLASP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-Modal Protein Representation Learning with CLASP

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact