bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
ProteinLanguage model

ProLoc

Nanjing University

Text-guided localization model that grounds natural-language functional descriptions to specific residue regions of a protein sequence.

Released: June 2026

ProLoc addresses a gap between protein function prediction and mechanistic interpretation. Most protein-text and protein function models capture global, protein-level associations: they can tell you that a protein has a given function, but not which residues are responsible for it. For researchers trying to understand a mechanism or to prioritize residues for experimental validation, that whole-protein answer is too coarse. ProLoc reframes the problem as a span-level grounding task: given a protein sequence and a free-text functional description, it identifies the specific residue regions—domains, motifs, or functional sites—that correspond to that description.

Developed by Peishuo Liu, Jiaxin Fan, Mianzhi Pan, and Jianbing Zhang at Nanjing University and released as a preprint in June 2026, ProLoc introduces both the task formulation, which the authors call text-guided protein functional region localization, and a model built to solve it. The work pairs a curated benchmark derived from InterPro annotations with a text-conditioned localization model that combines a protein language model and a biomedical text encoder.

The framing borrows the notion of visual grounding from vision-language research and applies it to proteins, treating the residue sequence as the medium to be localized within and the functional description as the query. This makes ProLoc useful as a residue-level annotation and hypothesis-generation tool rather than a global classifier.

#Key Features

  • Text-guided residue localization: Given a protein sequence and a natural-language functional description, ProLoc returns the residue spans corresponding to that description, enabling residue-level interpretation rather than whole-protein labels.
  • Generic, open-vocabulary inference: A single trained checkpoint accepts any protein sequence and any free-text query covering the InterPro annotation space, and is applied without per-task retraining.
  • Anchor-free span proposals: Beyond a direct localization output, the model generates anchor-free span proposals that improve recovery of multiple disjoint functional sites within one protein.
  • Dual-encoder design: ProLoc builds on ESM2-650M for the protein sequence and PubMedBERT for the text description, conditioning residue-level predictions on the functional query.
  • Purpose-built benchmark: The accompanying InterPro-derived benchmark provides explicit protein-text-region examples with sequence-similarity-aware splits and a unified span-level evaluation protocol.

#Technical Details

ProLoc is a text-conditioned localization model built on a frozen-vocabulary pairing of ESM2-650M, a 650-million-parameter protein language model, and PubMedBERT, a biomedical-domain text encoder. It performs direct residue-level localization and includes an anchor-free span proposal mechanism for recovering multiple functional regions. Training and evaluation use a benchmark constructed from InterPro annotations covering both domain-level and functional-site descriptions, with sequence-similarity-aware splits designed to test generalization to dissimilar sequences. On the held-out test set, the direct output reaches the strongest single-region localization performance at 0.7730 IoU@1, while the anchor-free proposal output improves visible multi-site recovery, reaching 0.9671 VM R@10 IoU50 and 0.9489 VM All-Hit@50. The authors report that ProLoc substantially outperforms window-based adaptations of representative protein and protein-text models on the same benchmark.

#Applications

ProLoc supports residue-level functional annotation of proteins, particularly for newly sequenced or under-characterized proteins where a functional description is available but the responsible regions are unknown. By localizing text descriptions to specific spans, it helps researchers prioritize residues for experimental validation, interpret the structural or mechanistic basis of a function, and pinpoint domains, motifs, and functional sites. The open-vocabulary text query makes it adaptable across the breadth of InterPro annotations without retraining for each function of interest.

#Impact

ProLoc defines text-guided protein functional region localization as a distinct span-level grounding task and supplies both a benchmark and a baseline model for it, establishing an evaluation framework that future protein-text models can be measured against. Its emphasis on residue-level grounding rather than global classification moves protein-text modeling toward mechanistic interpretability and experimental prioritization. As of mid-2026 the work is a preprint awaiting peer review; no source code, pretrained weights, or hosted API have been released, and the work is distributed under a restrictive (non-commercial) license, which currently limits independent reproduction and downstream reuse.

Citation

DOI: 10.64898/2026.06.24.733131

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
10Closed
Usability — can I run it?7
Reproducibility — can I retrain it?14
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

functional_region_localizationlanguage_modelmultimodalprotein_function_predictionproteomicstransformer

Resources

Research Paper