Hebrew University of Jerusalem
Dual-encoder contrastive model that embeds protein domains and peptides into a shared space to predict domain-peptide binding specificity at proteome scale.
Many cellular signaling and trafficking processes are governed by short linear motifs (SLiMs) — peptide segments that bind to modular protein domains such as SH3 and PDZ. These domain-peptide interactions are typically weak, transient, and underrepresented in structural databases, which makes them difficult to predict from sequence alone. CLIPepPI, developed by Hochner-Vilk and colleagues at the Hebrew University of Jerusalem and collaborators (preprint posted March 2026), addresses this problem with a contrastive-learning approach borrowed from multimodal representation learning.
Rather than treating interaction prediction as a supervised binary classification task that requires curated negative examples, CLIPepPI uses two separate encoders — one for protein domains and one for peptides — and trains them to project genuine binding partners into nearby points in a shared embedding space. Because the model learns from positive pairs alone, it sidesteps the chronic problem of constructing reliable negatives for sparse interaction data. A binding score is simply the similarity between a domain embedding and a peptide embedding.
The result is a single fixed checkpoint that generalizes across distinct benchmarks and scales to proteome-wide screens without task-specific retraining, positioning CLIPepPI alongside ESM-based representation models as a practical tool for SLiM biology.
clip_inference.py) embeds
domains and peptides and scores pairs, processing roughly 100 domain-peptide
pairs per second on an A40 GPU.CLIPepPI is built on the ESM-C protein language model, with the domain and
peptide encoders each initialized from ESM-C weights and fine-tuned using LoRA
adapters for parameter-efficient training. The model was trained on roughly 3,000
protein-peptide complexes from PPI3D, augmented with approximately 150,000
domain-peptide pairs derived from protein-protein interfaces, optionally guided by
marking interface residues. A single fixed checkpoint is evaluated across three
benchmarks — PPI3D domain-peptide complexes, the large-scale ProP-PD phage-display
dataset, and a curated set of nuclear export signals (NES) — and is further applied
to proteome-scale NES scanning and variant-effect prediction. The inference CLI
(clip_inference.py) supports separate domain-embedding, peptide-embedding, and
pair-scoring modes with a default batch size of 64.
CLIPepPI is aimed at researchers studying peptide-mediated interactions: mapping which domains a candidate motif binds, scanning a proteome for nuclear export signals, or estimating how a point mutation alters binding specificity (variant-effect prediction). Because the same checkpoint covers all of these tasks, structural biologists and systems biologists can apply it directly to new peptides or domains without assembling task-specific training sets, either through the inference CLI or the hosted web server at bio3d.cs.huji.ac.il.
By recasting domain-peptide specificity prediction as a contrastive embedding problem trained on positive pairs, CLIPepPI offers a scalable alternative to supervised interaction classifiers that struggle with weak, data-poor SLiM interactions. Its code is released under the Apache 2.0 license with weights deposited on HuggingFace, lowering the barrier to reuse and extension; the accompanying bioRxiv preprint, however, is distributed under a no-reuse license, so its text and figures are not freely reusable. As a preprint, its benchmark results await peer review, but the parameter-efficient, retraining-free design is a practical contribution to the growing toolkit for short-linear-motif biology.
Hochner-Vilk, T., et al. (2026) CliPepPI: Scalable prediction of domain-peptide specificity using contrastive learning. bioRxiv.
DOI: 10.64898/2026.03.18.712595