Overview

MaskedProteinEnT is a structure-conditioned protein encoder developed by Sai Pooja Mahajan, Jeffrey A. Ruffolo, and Jeffrey J. Gray at the Johns Hopkins University GrayLab. Introduced as a preprint in July 2023 and published in Cell Systems in 2025, the model addresses a central question in computational protein biology: how much information about the optimal amino acid at a given position can be recovered from structural context alone, without observing the identity of neighboring residues?

The model builds on the masked language modeling (MLM) paradigm — the training approach that underpins large protein sequence models like ESM — but applies it to protein graph representations derived from three-dimensional structures. Rather than masking tokens in a sequence, MaskedProteinEnT masks residue identities in a structure-based graph and trains the model to predict the correct amino acid from the geometric and chemical environment encoded in the surrounding unmasked residues. This structural MLM objective teaches the model what amino acids are compatible with given spatial contexts, directly relevant to the protein design problem of finding sequences that adopt a target fold.

A distinctive aspect of MaskedProteinEnT is its hierarchical training strategy: the model is pretrained on general protein structures, then fine-tuned on increasingly specific contexts — protein-protein interfaces, antibody-antigen complexes — using progressively smaller but more specialized datasets. This contextual transfer approach allows the model to leverage structural databases at different scales and improve performance on tasks like CDR loop design where labeled data is limited.

Key Features

Structure-conditioned masked language modeling: Amino acid identities are masked in a structural graph and predicted from the three-dimensional environment of surrounding residues, teaching the model context-dependent sequence compatibility without evolutionary data.
E(n)-equivariant graph transformer architecture: The model uses an equivariant graph neural network that respects the rotational and translational symmetry of protein structures, ensuring that predictions are invariant to arbitrary choice of coordinate frame.
Hierarchical fine-tuning for specialized contexts: Pretraining on general protein structures is followed by fine-tuning on protein-protein interfaces and antibody-antigen complexes, improving performance on CDR loop design — particularly the hypervariable CDR H3.
Multi-context training: The model integrates data sources of varying quality and specificity, including general protein structural databases, synthetic antibody libraries, and experimental antibody-antigen complex structures.
Sequence space exploration: The sampled sequence distributions from MaskedProteinEnT recapitulate the evolutionary sequence neighborhood of wildtype proteins, suggesting the model has learned functionally meaningful constraints rather than memorizing training structures.

Technical Details

MaskedProteinEnT implements an E(n)-equivariant graph transformer in which protein residues are nodes and edges encode pairwise geometric relationships (distances, dihedral angles, and relative orientations). Node and edge features are updated through equivariant message-passing layers that preserve the SE(3) symmetry of three-dimensional space. The masking objective randomly hides residue identity labels at a fraction of positions and computes cross-entropy loss against the true amino acids, analogous to BERT-style pretraining in NLP.

Pretraining used structures drawn from the Protein Data Bank combined with high-confidence AlphaFold predictions; fine-tuning for antibody design used curated datasets of synthetic and experimentally determined antibody-antigen complexes. On native sequence recovery benchmarks, MaskedProteinEnT achieves recovery rates competitive with leading inverse folding methods. For antibody CDR H3 loop design — the most challenging CDR due to its length and structural diversity — fine-tuned models trained with the hierarchical strategy show improved native recovery over models trained directly on antibody data alone, confirming the value of pretraining on broader protein contexts. Reconstructed sequences for highly plastic (conformationally flexible) structures preserve the conformational flexibility encoded in the structures, indicating that the model captures functional rather than merely structural constraints.

Applications

MaskedProteinEnT is directly applicable to computational protein design, particularly in cases where a target structure is known or predicted and new sequences compatible with that structure must be generated. In antibody engineering, the fine-tuned model provides contextually aware sequence suggestions for CDR loops at the antibody-antigen interface, supporting lead optimization and humanization campaigns. For general protein engineering, the pretrained model enables rapid generation of diverse sequences that adopt a given fold, complementing physics-based design tools. The model's contextual representations also serve as features for downstream tasks such as binding affinity prediction, stability scoring, and interface hotspot identification, and the Zenodo-deposited sampled sequences and structure datasets support further benchmarking by the community.

Impact

MaskedProteinEnT contributed to establishing the paradigm of structure-conditioned masked language modeling as a principled approach to inverse folding and protein design, demonstrating that the information content of local structural environments is sufficient to substantially constrain sequence identity. The hierarchical fine-tuning approach provided a practical recipe for adapting general structural knowledge to specialized protein classes with limited labeled data — a challenge common in antibody engineering. Published in Cell Systems, the work added to a growing body of evidence that MLM-based objectives are broadly transferable across biomolecular modalities. A limitation is that the model relies on structural input, requiring a known or predicted structure to generate sequences; it does not operate in a purely sequence-to-sequence mode like ESM or ProtT5.

Overview

Key Features

Structure-conditioned masked language modeling: Amino acid identities are masked in a structural graph and predicted from the three-dimensional environment of surrounding residues, teaching the model context-dependent sequence compatibility without evolutionary data.

E(n)-equivariant graph transformer architecture: The model uses an equivariant graph neural network that respects the rotational and translational symmetry of protein structures, ensuring that predictions are invariant to arbitrary choice of coordinate frame.

Hierarchical fine-tuning for specialized contexts: Pretraining on general protein structures is followed by fine-tuning on protein-protein interfaces and antibody-antigen complexes, improving performance on CDR loop design — particularly the hypervariable CDR H3.

Multi-context training: The model integrates data sources of varying quality and specificity, including general protein structural databases, synthetic antibody libraries, and experimental antibody-antigen complex structures.

Sequence space exploration: The sampled sequence distributions from MaskedProteinEnT recapitulate the evolutionary sequence neighborhood of wildtype proteins, suggesting the model has learned functionally meaningful constraints rather than memorizing training structures.

Technical Details

Applications

Impact

MaskedProteinEnT

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

MaskedProteinEnT

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources