ProteinSage

Structure-aware protein language model using structure-guided masking and a causal objective for variant effect prediction and protein discovery.

Released: March 2026

ProteinSage is a protein language model from BioMap, released as a bioRxiv preprint in March 2026, that addresses how protein language models acquire structural knowledge. Standard sequence-based models learn structure only implicitly, as a byproduct of masked language modeling over large sequence corpora. ProteinSage instead makes structure an explicit objective, incorporating structural signals through structure-guided masking and a causal objective designed to capture long-range dependencies.

The result is a pretraining framework that learns transferable protein representations under explicit structural constraints while using less data and computation than purely implicit approaches. Across diverse structure-aware and general protein modeling benchmarks, ProteinSage reports competitive or superior performance, supporting the argument that explicitly guiding what the model attends to during pretraining is more efficient than hoping structure emerges on its own.

Beyond benchmarks, the authors apply ProteinSage to a real discovery task with challenging multi-pass transmembrane proteins, using it to identify six previously unannotated microbial rhodopsin homologs zero-shot — a demonstration that its structure-informed representations generalize to distant, hard-to-annotate families.

Key Features

Structure-guided masking: Masking is informed by structural signals rather than applied uniformly at random, focusing learning on structurally meaningful positions.
Causal long-range objective: A causal training objective is used to capture long-range dependencies across the sequence.
Explicit structural constraints: Structure is treated as an explicit pretraining target, moving from implicit to explicit structure learning.
Data and compute efficiency: Achieves transferable representations with less data and computation than implicit-only protein language models.
Zero-shot discovery: Identified six previously unannotated microbial rhodopsin homologs zero-shot, demonstrating generalization to distant protein families.

Technical Details

ProteinSage is a transformer-based protein language model whose pretraining injects structural signals through structure-guided masking combined with a causal objective that models long-range dependencies. This structure-constrained pretraining is reported to yield transferable representations using less data and computation than implicit masked-language-modeling baselines, while matching or exceeding them across structure-aware and general protein modeling benchmarks. As a downstream validation, the model was applied to multi-pass transmembrane helical proteins and recovered six previously unannotated microbial rhodopsin homologs in a zero-shot setting.

Applications

ProteinSage is intended as a backbone for protein representation learning where structural context matters — variant-effect prediction, structure-aware property modeling, and homolog discovery in difficult families such as transmembrane proteins. Its rhodopsin discovery result highlights a concrete use case: mining sequence databases for functionally relevant but unannotated proteins, supporting protein engineering and functional genomics workflows at BioMap and potentially beyond.

Impact

ProteinSage contributes to an active line of research on how, and how efficiently, protein language models learn structure, arguing for explicit structural objectives over purely implicit learning. The zero-shot recovery of unannotated rhodopsin homologs is an encouraging sign that the approach generalizes to hard cases. Its broader impact is currently limited by access: the preprint is released under a CC-BY-NC license from a commercial developer, and no public weights or code accompany it, so independent reproduction and downstream adoption are constrained.

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Shen, L., et al. (2026) ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling. bioRxiv.

DOI: 10.64898/2026.03.17.712034

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References84

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?7

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Structure-guided masking: Masking is informed by structural signals rather than applied uniformly at random, focusing learning on structurally meaningful positions.

Causal long-range objective: A causal training objective is used to capture long-range dependencies across the sequence.

Explicit structural constraints: Structure is treated as an explicit pretraining target, moving from implicit to explicit structure learning.

Data and compute efficiency: Achieves transferable representations with less data and computation than implicit-only protein language models.

Zero-shot discovery: Identified six previously unannotated microbial rhodopsin homologs zero-shot, demonstrating generalization to distant protein families.

Technical Details

Applications

Impact

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Shen, L., et al. (2026) ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling. bioRxiv.

DOI: 10.64898/2026.03.17.712034

ProteinSage

Key Features

Technical Details

Applications

Impact

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

ProteinSage

Key Features

Technical Details

Applications

Impact

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

ProteinSage

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

ProteinSage

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProteinSage: From implicit learning to explicit structural constraints for efficient protein language modeling

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact