bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

CLASP

McGill University

Tri-modal contrastive framework that aligns protein structure, sequence, and natural-language description in a shared space for zero-shot retrieval and classification.

Released: August 2025

CLASP (Contrastive Language-Amino acid Sequence-Structure Pretraining) is a tri-modal representation-learning framework that embeds a protein's three-dimensional structure, amino-acid sequence, and natural-language description into a single shared vector space. Most protein representation models capture one or two modalities, such as sequence-only language models or structure encoders; CLASP instead aligns all three jointly so that information from any one modality can be retrieved or classified using any other.

The framework was developed by Nicolas Bolouri, Joseph Szymborski, and Amin Emad at McGill University and affiliated Montreal institutions (Mila, the Goodman Cancer Institute, and the Dahdaleh Institute of Genomic Medicine), with a preprint posted to bioRxiv in August 2025. It adapts a CLIP-style contrastive objective, generalized from prior multi-modal work, to the protein domain.

By training on a contrastive objective across modality pairs, CLASP learns biologically meaningful relationships: its structure and sequence embeddings cluster by protein family and functional class. This enables zero-shot tasks such as identifying the correct sequence given a structure or retrieving proteins from a text description, outperforming baselines limited to fewer modalities.

#Key Features

  • Tri-modal alignment: Jointly embeds structure, sequence, and text description in one shared space, enabling retrieval and classification across any pair of modalities.
  • Zero-shot cross-modal retrieval: Matches structures to sequences, structures to descriptions, and sequences to descriptions without task-specific fine-tuning, outperforming single- and dual-modality baselines.
  • Biologically structured embeddings: Learned representations cluster by protein family and functional class, indicating capture of structure-function relationships.
  • Modality complementarity: Ablations show that removing any single modality during training degrades performance, confirming each contributes distinct signal.

#Technical Details

CLASP encodes each modality with a dedicated backbone and aligns them through a contrastive loss. Structure is represented with an E(3)-invariant graph neural network (EGNN) operating on protein graphs built with Graphein; sequence is embedded with the ProtT5 protein language model; and natural-language descriptions are embedded with BioGPT. The resulting per-modality embeddings are projected into a shared space and trained with a tri-modal contrastive objective inspired by CLIP-style 3D contrastive pretraining. The authors report training on the order of 13 hours per run on a single NVIDIA RTX 3090 GPU. Evaluations cover zero-shot classification and retrieval across all modality pairs, with ablations quantifying the contribution of each modality. Code is released on GitHub under GPL-3.0.

#Applications

CLASP is useful for protein annotation and search workflows where queries and targets live in different modalities, for example finding sequences or structures that match a functional text description, or annotating an uncharacterized structure by retrieving similar described proteins. Its joint embeddings can serve as features for downstream family or function classification, supporting researchers in functional genomics and protein characterization.

#Impact

CLASP extends contrastive multimodal learning, popularized for images and text, into a genuinely tri-modal protein setting, demonstrating that structure, sequence, and language can be aligned in one space at modest compute. The released code lowers the barrier to reuse and extension. As a recent preprint, its broader influence will depend on independent benchmarking against established protein-text models, but it illustrates a practical recipe for unifying heterogeneous protein data for zero-shot retrieval and classification.

Tags

cross_modal_retrievalprotein_classificationrepresentation_learninggraph_neural_networktransformercontrastive_learningmultimodalzero_shotprotein_functionprotein_structure