Hebrew University of Jerusalem
Universal protein language model pretrained on 106M UniRef90 sequences with dual objectives: masked language modeling and Gene Ontology annotation prediction.
ProteinBERT is a protein language model developed by Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial at the Hebrew University of Jerusalem and Ben-Gurion University. Published in Bioinformatics in 2022, it was among the first models to explicitly incorporate functional annotation signals — specifically Gene Ontology (GO) terms — as a pretraining objective alongside masked language modeling, distinguishing it from contemporaries that relied solely on sequence reconstruction.
The model addresses a longstanding challenge in protein representation learning: how to encode both local sequence patterns and global functional properties within a single unified architecture. Traditional transformer-based language models process sequences through self-attention with quadratic complexity in sequence length, limiting their practical application to proteins beyond a few thousand residues. ProteinBERT introduced a dual-pathway architecture that separates local (per-residue) and global (whole-protein) representations, enabling efficient processing of proteins spanning 128 to over 16,000 amino acids in length.
Despite containing only approximately 16 million parameters — an order of magnitude smaller than many contemporary protein language models — ProteinBERT achieves competitive or superior performance across a broad range of downstream property prediction benchmarks. This efficiency made it particularly attractive as a transfer learning backbone in settings where computational resources are constrained.
ProteinBERT consists of six transformer-like blocks arranged in a dual-pathway design. The local pathway processes per-residue amino acid tokens using a combination of 1D convolutional layers and global-attention layers, with skip connections and layer normalization throughout. The global pathway maintains a fixed-size whole-protein representation updated by fully connected layers and receives compressed summaries from the local track at each block. This architecture allows the two pathways to exchange information at every layer, enabling the final representations to jointly reflect position-specific and protein-wide signals.
The model was pretrained on approximately 106 million sequences from UniRef90, covering the breadth of known protein sequence space. Training ran for roughly 28 days and approximately 6.4 epochs over the full dataset. The GO annotation pretraining task used 8,943 terms — those appearing at least 100 times across the dataset — as multi-label prediction targets, providing a rich functional supervision signal beyond sequence reconstruction alone. Downstream benchmarks span nine tasks including secondary structure prediction, remote homology detection, post-translational modification (PTM) site prediction, fluorescence and stability regression, signal peptide cleavage site prediction, and neuropeptide cleavage.
ProteinBERT serves as a transfer learning backbone for predicting diverse protein properties where labeled data is scarce. Researchers apply it to PTM site prediction, signal peptide identification, and secondary structure annotation. Its linear-complexity attention is particularly valuable for analyzing large proteins or full proteome scans where standard transformer models would be computationally prohibitive. Because fine-tuning requires only minutes on commodity hardware, it is well-suited to academic and resource-limited environments. The integrated GO annotation embeddings also make it a natural fit for function prediction tasks that require awareness of broad biological process, molecular function, and cellular component categories.
ProteinBERT demonstrated that combining sequence and functional annotation objectives during pretraining could yield competitive protein representations with far fewer parameters than pure sequence-based models, influencing subsequent work on multi-task protein pretraining strategies. Its open release on GitHub with pretrained weights and a step-by-step Colab demo lowered the barrier to entry for wet-lab researchers seeking to apply deep learning to protein property prediction. The model has been cited extensively as a baseline in the protein representation learning literature and remains a practical reference point for evaluating the trade-off between model scale and predictive performance. A notable limitation is that ProteinBERT does not encode three-dimensional structural information and was not designed for structure prediction; it is best understood as a sequence- and function-aware embedding model rather than a structure-prediction system.
Brandes, N., et al. (2021) ProteinBERT: a universal deep-learning model of protein sequence and function. bioRxiv.
DOI: 10.1093/bioinformatics/btac020