CLIPepPI

Contrastive dual-encoder model embedding protein domains and peptides in one space to predict domain-peptide binding specificity at proteome scale.

Released: March 2026

Many cellular signaling and trafficking processes are governed by short linear motifs (SLiMs) — peptide segments that bind to modular protein domains such as SH3 and PDZ. These domain-peptide interactions are typically weak, transient, and underrepresented in structural databases, which makes them difficult to predict from sequence alone. CLIPepPI, developed by Hochner-Vilk and colleagues at the Hebrew University of Jerusalem and collaborators (preprint posted March 2026), addresses this problem with a contrastive-learning approach borrowed from multimodal representation learning.

Rather than treating interaction prediction as a supervised binary classification task that requires curated negative examples, CLIPepPI uses two separate encoders — one for protein domains and one for peptides — and trains them to project genuine binding partners into nearby points in a shared embedding space. Because the model learns from positive pairs alone, it sidesteps the chronic problem of constructing reliable negatives for sparse interaction data. A binding score is simply the similarity between a domain embedding and a peptide embedding.

The result is a single fixed checkpoint that generalizes across distinct benchmarks and scales to proteome-wide screens without task-specific retraining, positioning CLIPepPI alongside ESM-based representation models as a practical tool for SLiM biology.

Key Features

Dual-encoder contrastive design: Separate domain and peptide encoders map sequences into a shared latent space, so binding specificity is read off as embedding similarity rather than a trained classifier output.
Positive-pairs-only training: The contrastive objective learns from genuine complexes alone, avoiding the need to fabricate negative interaction examples that plague supervised approaches on sparse data.
ESM-C initialization with LoRA: Both encoders inherit pretrained protein language-model representations and are adapted with lightweight LoRA adapters, keeping fine-tuning parameter-efficient.
Fixed checkpoint, broad generalization: One model handles three independent benchmarks (PPI3D, ProP-PD, and a nuclear export signal set) and proteome-scale tasks without re-training.
Proteome-scale inference: A command-line tool (clip_inference.py) embeds domains and peptides and scores pairs, processing roughly 100 domain-peptide pairs per second on an A40 GPU.

Technical Details

CLIPepPI is built on the ESM-C protein language model, with the domain and peptide encoders each initialized from ESM-C weights and fine-tuned using LoRA adapters for parameter-efficient training. The model was trained on roughly 3,000 protein-peptide complexes from PPI3D, augmented with approximately 150,000 domain-peptide pairs derived from protein-protein interfaces, optionally guided by marking interface residues. A single fixed checkpoint is evaluated across three benchmarks — PPI3D domain-peptide complexes, the large-scale ProP-PD phage-display dataset, and a curated set of nuclear export signals (NES) — and is further applied to proteome-scale NES scanning and variant-effect prediction. The inference CLI (clip_inference.py) supports separate domain-embedding, peptide-embedding, and pair-scoring modes with a default batch size of 64.

Applications

CLIPepPI is aimed at researchers studying peptide-mediated interactions: mapping which domains a candidate motif binds, scanning a proteome for nuclear export signals, or estimating how a point mutation alters binding specificity (variant-effect prediction). Because the same checkpoint covers all of these tasks, structural biologists and systems biologists can apply it directly to new peptides or domains without assembling task-specific training sets, either through the inference CLI or the hosted web server at bio3d.cs.huji.ac.il.

Impact

By recasting domain-peptide specificity prediction as a contrastive embedding problem trained on positive pairs, CLIPepPI offers a scalable alternative to supervised interaction classifiers that struggle with weak, data-poor SLiM interactions. Its code is released under the Apache 2.0 license with weights deposited on HuggingFace, lowering the barrier to reuse and extension; the accompanying bioRxiv preprint, however, is distributed under a no-reuse license, so its text and figures are not freely reusable. As a preprint, its benchmark results await peer review, but the parameter-efficient, retraining-free design is a practical contribution to the growing toolkit for short-linear-motif biology.

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Hochner-Vilk, T., et al. (2026) CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning. bioRxiv.

DOI: 10.64898/2026.03.18.712595

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References86

GitHub

Stars2

Forks0

Open Issues0

Contributors1

Last Push2mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

50Partial

Usability — can I run it?64

Reproducibility — can I retrain it?50

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Demo Dataset

Key Features

Dual-encoder contrastive design: Separate domain and peptide encoders map sequences into a shared latent space, so binding specificity is read off as embedding similarity rather than a trained classifier output.

Positive-pairs-only training: The contrastive objective learns from genuine complexes alone, avoiding the need to fabricate negative interaction examples that plague supervised approaches on sparse data.

ESM-C initialization with LoRA: Both encoders inherit pretrained protein language-model representations and are adapted with lightweight LoRA adapters, keeping fine-tuning parameter-efficient.

Fixed checkpoint, broad generalization: One model handles three independent benchmarks (PPI3D, ProP-PD, and a nuclear export signal set) and proteome-scale tasks without re-training.

Proteome-scale inference: A command-line tool (clip_inference.py) embeds domains and peptides and scores pairs, processing roughly 100 domain-peptide pairs per second on an A40 GPU.

Technical Details

Applications

Impact

CLIPepPI

Key Features

Technical Details

Applications

Impact

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CLIPepPI

Key Features

Technical Details

Applications

Impact

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CLIPepPI

#Key Features

#Technical Details

#Applications

#Impact

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

CLIPepPI

#Key Features

#Technical Details

#Applications

#Impact

Citation

CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact