bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

CLIPepPI

Hebrew University of Jerusalem

Dual-encoder contrastive model that embeds protein domains and peptides into a shared space to predict domain-peptide binding specificity at proteome scale.

Released: March 2026

Many cellular signaling and trafficking processes are governed by short linear motifs (SLiMs) — peptide segments that bind to modular protein domains such as SH3 and PDZ. These domain-peptide interactions are typically weak, transient, and underrepresented in structural databases, which makes them difficult to predict from sequence alone. CLIPepPI, developed by Hochner-Vilk and colleagues at the Hebrew University of Jerusalem and collaborators (preprint posted March 2026), addresses this problem with a contrastive-learning approach borrowed from multimodal representation learning.

Rather than treating interaction prediction as a supervised binary classification task that requires curated negative examples, CLIPepPI uses two separate encoders — one for protein domains and one for peptides — and trains them to project genuine binding partners into nearby points in a shared embedding space. Because the model learns from positive pairs alone, it sidesteps the chronic problem of constructing reliable negatives for sparse interaction data. A binding score is simply the similarity between a domain embedding and a peptide embedding.

The result is a single fixed checkpoint that generalizes across distinct benchmarks and scales to proteome-wide screens without task-specific retraining, positioning CLIPepPI alongside ESM-based representation models as a practical tool for SLiM biology.

#Key Features

  • Dual-encoder contrastive design: Separate domain and peptide encoders map sequences into a shared latent space, so binding specificity is read off as embedding similarity rather than a trained classifier output.
  • Positive-pairs-only training: The contrastive objective learns from genuine complexes alone, avoiding the need to fabricate negative interaction examples that plague supervised approaches on sparse data.
  • ESM-C initialization with LoRA: Both encoders inherit pretrained protein language-model representations and are adapted with lightweight LoRA adapters, keeping fine-tuning parameter-efficient.
  • Fixed checkpoint, broad generalization: One model handles three independent benchmarks (PPI3D, ProP-PD, and a nuclear export signal set) and proteome-scale tasks without re-training.
  • Proteome-scale inference: A command-line tool (clip_inference.py) embeds domains and peptides and scores pairs, processing roughly 100 domain-peptide pairs per second on an A40 GPU.

#Technical Details

CLIPepPI is built on the ESM-C protein language model, with the domain and peptide encoders each initialized from ESM-C weights and fine-tuned using LoRA adapters for parameter-efficient training. The model was trained on roughly 3,000 protein-peptide complexes from PPI3D, augmented with approximately 150,000 domain-peptide pairs derived from protein-protein interfaces, optionally guided by marking interface residues. A single fixed checkpoint is evaluated across three benchmarks — PPI3D domain-peptide complexes, the large-scale ProP-PD phage-display dataset, and a curated set of nuclear export signals (NES) — and is further applied to proteome-scale NES scanning and variant-effect prediction. The inference CLI (clip_inference.py) supports separate domain-embedding, peptide-embedding, and pair-scoring modes with a default batch size of 64.

#Applications

CLIPepPI is aimed at researchers studying peptide-mediated interactions: mapping which domains a candidate motif binds, scanning a proteome for nuclear export signals, or estimating how a point mutation alters binding specificity (variant-effect prediction). Because the same checkpoint covers all of these tasks, structural biologists and systems biologists can apply it directly to new peptides or domains without assembling task-specific training sets, either through the inference CLI or the hosted web server at bio3d.cs.huji.ac.il.

#Impact

By recasting domain-peptide specificity prediction as a contrastive embedding problem trained on positive pairs, CLIPepPI offers a scalable alternative to supervised interaction classifiers that struggle with weak, data-poor SLiM interactions. Its code is released under the Apache 2.0 license with weights deposited on HuggingFace, lowering the barrier to reuse and extension; the accompanying bioRxiv preprint, however, is distributed under a no-reuse license, so its text and figures are not freely reusable. As a preprint, its benchmark results await peer review, but the parameter-efficient, retraining-free design is a practical contribution to the growing toolkit for short-linear-motif biology.

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Hochner-Vilk, T., et al. (2026) CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning. bioRxiv.

DOI: 10.64898/2026.03.18.712595

Openness

Unclassified
Restrictive license on core components

Tags

contrastive_learningpeptide_binding_predictionprotein_protein_interactionproteomicstransfer_learningtransformervariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDemoDataset