ProtSent

Hebrew University of Jerusalem / Ben-Gurion University of the Negev

Protein sequence embedding model, contrastively fine-tuned from ESM-2, that places functionally and structurally related proteins close together.

Released: May 2026

ProtSent (Protein Sentence Transformers) addresses a practical limitation of protein language models such as ESM-2: their per-residue representations are optimized for masked-token prediction, not for producing a single sequence-level vector that places functionally or structurally related proteins close together. Many downstream workflows—retrieval, clustering, similarity search, and lightweight classification—rely on exactly such fixed-length embeddings, and naive mean-pooling of a pretrained backbone often leaves substantial signal on the table. ProtSent reframes the problem in the spirit of sentence-embedding models from natural language processing, adapting a general-purpose backbone into a dedicated embedding model through contrastive fine-tuning.

Developed by Dan Ofer and Michal Linial at the Hebrew University of Jerusalem together with Nadav Rappoport and Oriel Perets at Ben-Gurion University of the Negev, the work was released as an arXiv preprint in May 2026. The authors release two checkpoints, fine-tuned from the ESM-2 35M and 150M backbones, so users can trade off embedding quality against compute and memory footprint.

The method is deliberately lightweight: both variants were trained on a single GPU in roughly a day, and the resulting checkpoints are evaluated frozen, with no per-task retraining of the backbone.

Key Features

Two model scales: Checkpoints fine-tuned from ESM-2 35M and ESM-2 150M let users choose between a compact, fast embedder and a higher-capacity variant with stronger downstream performance.
General-purpose embeddings: A single mean-pooled vector per sequence supports classification, regression, retrieval, clustering, and similarity search without task-specific architecture changes.
Multi-source contrastive supervision: Training pairs draw biological similarity from five complementary signals—family membership, structure, interactions, and mutational fitness—rather than a single proxy for relatedness.
Frozen, drop-in use: Embeddings are evaluated with a simple k-nearest-neighbor probe, so the models slot directly into existing pipelines as a feature extractor.
Open release: Code (MIT) and both weight checkpoints (MIT) are publicly available through GitHub and Hugging Face.

Technical Details

ProtSent is built on the SentenceTransformers framework, applying MultipleNegativesRankingLoss together with CoSENTLoss over mean-pooled representations of the ESM-2 backbone. Training uses roughly 70 million protein pairs assembled from five datasets: Pfam families (~32.9M pairs) and structurally-derived Pfam hard negatives (~1.8M), AlphaFold DB structural pairs (~133.9M sampled), STRING-DB interactions (~36.5M), and ProteinGym deep mutational scanning data (~2.2M). Reported settings include the AdamW optimizer, batch size 1024, a contrastive temperature of 0.05, and dropout of 0.1; each variant trained in about 1.3 days on a single NVIDIA RTX 6000 Ada (48GB).

Checkpoints are evaluated frozen across 23 downstream tasks spanning binary classification, multiclass classification, and regression, using a KNN probe. The 150M variant shows a +105% improvement on remote homology (fold) detection and a +19.9% gain in SCOPe-40 Recall@1 for structural retrieval relative to the unmodified backbone, alongside gains such as +17.3% on GB1 variant effect prediction. The smaller 35M variant improves remote homology by +40.5% and SCOPe-40 Recall@1 by +15.5%, with the two models improving 15–16 of the 23 tasks depending on scale.

Applications

ProtSent targets researchers who need protein sequence embeddings as inputs to lightweight downstream models or search systems: detecting remote homologs, retrieving structurally similar proteins, clustering large sequence collections, and training simple probes for functional or fitness prediction. Because the checkpoints are used frozen and emit a fixed-length vector per sequence, they are convenient for groups that lack the resources to fine-tune large backbones and instead want a strong, general-purpose feature extractor that can be precomputed once and reused.

Impact

ProtSent demonstrates that contrastive fine-tuning with biologically grounded pair supervision can substantially sharpen the sequence-level embeddings of existing protein language models without scaling up parameters or compute—gains that are most pronounced on structure-aware tasks like remote homology and structural retrieval. By packaging the approach as small, openly licensed, drop-in checkpoints, it lowers the barrier to high-quality protein embeddings for similarity and retrieval workflows. As a recent preprint, its broader adoption and independent benchmarking remain to be established, and the relative scale of its training-pair sources (heavily weighted toward AlphaFold DB and Pfam) may shape where its embeddings are strongest.

Citation

ProtSent: Protein Sentence Transformers

Preprint

Ofer, D., et al. (2026) ProtSent: Protein Sentence Transformers.

DOI: 10.48550/arXiv.2605.06830

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References82

GitHub

Stars6

Forks0

Open Issues0

Contributors1

Last Push2mo ago

LanguagePython

HuggingFace

Downloads0

Likes2

Last Modified2mo ago

Pipelinesentence-similarity

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

87Open

Usability — can I run it?100

Reproducibility — can I retrain it?70

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Two model scales: Checkpoints fine-tuned from ESM-2 35M and ESM-2 150M let users choose between a compact, fast embedder and a higher-capacity variant with stronger downstream performance.

General-purpose embeddings: A single mean-pooled vector per sequence supports classification, regression, retrieval, clustering, and similarity search without task-specific architecture changes.

Multi-source contrastive supervision: Training pairs draw biological similarity from five complementary signals—family membership, structure, interactions, and mutational fitness—rather than a single proxy for relatedness.

Frozen, drop-in use: Embeddings are evaluated with a simple k-nearest-neighbor probe, so the models slot directly into existing pipelines as a feature extractor.

Open release: Code (MIT) and both weight checkpoints (MIT) are publicly available through GitHub and Hugging Face.

Technical Details

Applications

Impact

ProtSent

Key Features

Technical Details

Applications

Impact

Citation

ProtSent: Protein Sentence Transformers

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProtSent

Key Features

Technical Details

Applications

Impact

Citation

ProtSent: Protein Sentence Transformers

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProtSent

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProtSent: Protein Sentence Transformers

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProtSent

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProtSent: Protein Sentence Transformers

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact