bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

ProtSent

Hebrew University of Jerusalem / Ben-Gurion University of the Negev

Contrastively fine-tuned ESM-2 (35M and 150M) protein language models that produce general-purpose sequence embeddings where biological similarity maps to embedding proximity.

Released: May 2026

ProtSent (Protein Sentence Transformers) addresses a practical limitation of protein language models such as ESM-2: their per-residue representations are optimized for masked-token prediction, not for producing a single sequence-level vector that places functionally or structurally related proteins close together. Many downstream workflows—retrieval, clustering, similarity search, and lightweight classification—rely on exactly such fixed-length embeddings, and naive mean-pooling of a pretrained backbone often leaves substantial signal on the table. ProtSent reframes the problem in the spirit of sentence-embedding models from natural language processing, adapting a general-purpose backbone into a dedicated embedding model through contrastive fine-tuning.

Developed by Dan Ofer and Michal Linial at the Hebrew University of Jerusalem together with Nadav Rappoport and Oriel Perets at Ben-Gurion University of the Negev, the work was released as an arXiv preprint in May 2026. The authors release two checkpoints, fine-tuned from the ESM-2 35M and 150M backbones, so users can trade off embedding quality against compute and memory footprint.

The method is deliberately lightweight: both variants were trained on a single GPU in roughly a day, and the resulting checkpoints are evaluated frozen, with no per-task retraining of the backbone.

#Key Features

  • Two model scales: Checkpoints fine-tuned from ESM-2 35M and ESM-2 150M let users choose between a compact, fast embedder and a higher-capacity variant with stronger downstream performance.
  • General-purpose embeddings: A single mean-pooled vector per sequence supports classification, regression, retrieval, clustering, and similarity search without task-specific architecture changes.
  • Multi-source contrastive supervision: Training pairs draw biological similarity from five complementary signals—family membership, structure, interactions, and mutational fitness—rather than a single proxy for relatedness.
  • Frozen, drop-in use: Embeddings are evaluated with a simple k-nearest-neighbor probe, so the models slot directly into existing pipelines as a feature extractor.
  • Open release: Code (MIT) and both weight checkpoints (MIT) are publicly available through GitHub and Hugging Face.

#Technical Details

ProtSent is built on the SentenceTransformers framework, applying MultipleNegativesRankingLoss together with CoSENTLoss over mean-pooled representations of the ESM-2 backbone. Training uses roughly 70 million protein pairs assembled from five datasets: Pfam families (~32.9M pairs) and structurally-derived Pfam hard negatives (~1.8M), AlphaFold DB structural pairs (~133.9M sampled), STRING-DB interactions (~36.5M), and ProteinGym deep mutational scanning data (~2.2M). Reported settings include the AdamW optimizer, batch size 1024, a contrastive temperature of 0.05, and dropout of 0.1; each variant trained in about 1.3 days on a single NVIDIA RTX 6000 Ada (48GB).

Checkpoints are evaluated frozen across 23 downstream tasks spanning binary classification, multiclass classification, and regression, using a KNN probe. The 150M variant shows a +105% improvement on remote homology (fold) detection and a +19.9% gain in SCOPe-40 Recall@1 for structural retrieval relative to the unmodified backbone, alongside gains such as +17.3% on GB1 variant effect prediction. The smaller 35M variant improves remote homology by +40.5% and SCOPe-40 Recall@1 by +15.5%, with the two models improving 15–16 of the 23 tasks depending on scale.

#Applications

ProtSent targets researchers who need protein sequence embeddings as inputs to lightweight downstream models or search systems: detecting remote homologs, retrieving structurally similar proteins, clustering large sequence collections, and training simple probes for functional or fitness prediction. Because the checkpoints are used frozen and emit a fixed-length vector per sequence, they are convenient for groups that lack the resources to fine-tune large backbones and instead want a strong, general-purpose feature extractor that can be precomputed once and reused.

#Impact

ProtSent demonstrates that contrastive fine-tuning with biologically grounded pair supervision can substantially sharpen the sequence-level embeddings of existing protein language models without scaling up parameters or compute—gains that are most pronounced on structure-aware tasks like remote homology and structural retrieval. By packaging the approach as small, openly licensed, drop-in checkpoints, it lowers the barrier to high-quality protein embeddings for similarity and retrieval workflows. As a recent preprint, its broader adoption and independent benchmarking remain to be established, and the relative scale of its training-pair sources (heavily weighted toward AlphaFold DB and Pfam) may shape where its embeddings are strongest.

Citation

Preprint

DOI: 10.48550/arXiv.2605.06830

DOI: 10.48550/arXiv.2605.06830

Openness

Class III
Open Model

Tags

contrastive_learningembeddingsproteomicsremote_homology_detectionrepresentation_learningself_supervisedtransformervariant_effect_prediction

Resources

GitHub RepositoryResearch PaperHuggingFace Model