A protein language model that injects explicit structural constraints via structure-guided masking and a causal objective for efficient, structure-aware representation learning.
ProteinSage is a protein language model from BioMap, released as a bioRxiv preprint in March 2026, that addresses how protein language models acquire structural knowledge. Standard sequence-based models learn structure only implicitly, as a byproduct of masked language modeling over large sequence corpora. ProteinSage instead makes structure an explicit objective, incorporating structural signals through structure-guided masking and a causal objective designed to capture long-range dependencies.
The result is a pretraining framework that learns transferable protein representations under explicit structural constraints while using less data and computation than purely implicit approaches. Across diverse structure-aware and general protein modeling benchmarks, ProteinSage reports competitive or superior performance, supporting the argument that explicitly guiding what the model attends to during pretraining is more efficient than hoping structure emerges on its own.
Beyond benchmarks, the authors apply ProteinSage to a real discovery task with challenging multi-pass transmembrane proteins, using it to identify six previously unannotated microbial rhodopsin homologs zero-shot — a demonstration that its structure-informed representations generalize to distant, hard-to-annotate families.
ProteinSage is a transformer-based protein language model whose pretraining injects structural signals through structure-guided masking combined with a causal objective that models long-range dependencies. This structure-constrained pretraining is reported to yield transferable representations using less data and computation than implicit masked-language-modeling baselines, while matching or exceeding them across structure-aware and general protein modeling benchmarks. As a downstream validation, the model was applied to multi-pass transmembrane helical proteins and recovered six previously unannotated microbial rhodopsin homologs in a zero-shot setting.
ProteinSage is intended as a backbone for protein representation learning where structural context matters — variant-effect prediction, structure-aware property modeling, and homolog discovery in difficult families such as transmembrane proteins. Its rhodopsin discovery result highlights a concrete use case: mining sequence databases for functionally relevant but unannotated proteins, supporting protein engineering and functional genomics workflows at BioMap and potentially beyond.
ProteinSage contributes to an active line of research on how, and how efficiently, protein language models learn structure, arguing for explicit structural objectives over purely implicit learning. The zero-shot recovery of unannotated rhodopsin homologs is an encouraging sign that the approach generalizes to hard cases. Its broader impact is currently limited by access: the preprint is released under a CC-BY-NC license from a commercial developer, and no public weights or code accompany it, so independent reproduction and downstream adoption are constrained.