ESM-GearNet is a protein representation learning framework that systematically investigates how to best combine two complementary sources of biological information: amino acid sequence and three-dimensional protein structure. Developed by Zuobai Zhang, Chuanrui Wang, Minghao Xu, and colleagues at Mila — Quebec AI Institute and IBM Research under the direction of Jian Tang, the work addresses a fundamental tension in the field: protein language models (PLMs) trained on sequence alone capture rich evolutionary information but struggle to incorporate 3D geometry, while structure-based graph neural networks leverage spatial relationships but are bottlenecked by the limited number of experimentally determined structures available for pre-training.
The model integrates ESM-2, Meta AI's 650-million parameter protein language model, with GearNet, a multi-relational graph neural network that encodes protein 3D geometry. Rather than treating these as competing approaches, ESM-GearNet treats them as complementary representations that can be fused to produce embeddings richer than either modality alone. The study was released as an arXiv preprint in March 2023 and systematically evaluates three distinct fusion architectures and six pre-training objectives across multiple protein function benchmarks.
The core contribution is not a single architecture but a thorough empirical investigation that identifies which design choices matter most when combining sequence and structure. The finding that serial fusion — where ESM-2 embeddings initialize the node features of the GearNet structure encoder — consistently outperforms more complex cross-attention approaches provides clear, actionable guidance for practitioners building joint protein representations.
The default ESM-GearNet model pairs the ESM-2-650M sequence encoder with a 6-layer GearNet structure encoder using 512 hidden dimensions per layer. In serial fusion, the final-layer hidden states from ESM-2 replace the standard amino acid one-hot encodings as residue-level node features fed into GearNet. GearNet then applies relational message passing across the multi-relational protein graph, aggregating information from sequential, spatial, and k-NN neighbors under separate learned weight matrices. The full model's parameter count is dominated by ESM-2's 650 million parameters; GearNet contributes roughly 10 million additional parameters for the structural encoder layers.
Pre-training was performed on the AlphaFold Database v1 (365K structures) using 4 NVIDIA A100 GPUs for 50 epochs with a batch size of 256 and a learning rate of 2×10⁻⁴. A critical implementation detail is the use of a differential learning rate for the PLM component: the ESM-2 weights are updated at one-tenth of the structure encoder's learning rate to protect the pre-trained sequence representations from being overwritten during structural pre-training. On the Enzyme Commission (EC) number prediction benchmark, the best ESM-GearNet variant achieves an Fmax of 0.897, compared to 0.730 for GearNet alone and 0.880 for ESM-2 alone, demonstrating clear synergy between modalities. On Gene Ontology (GO) biological process prediction, Fmax improves from 0.488 (no pre-training) to 0.514 with SiamDiff pre-training.
ESM-GearNet is particularly well suited to protein function annotation tasks where both evolutionary context and structural geometry are informative. Key use cases include enzyme classification, Gene Ontology term prediction across biological process, molecular function, and cellular component categories, protein stability prediction from mutations, and protein structure quality ranking. Researchers working on any task where AlphaFold-predicted structures are available — which now encompasses most characterized proteomes — can use ESM-GearNet's pre-trained weights as a strong initialization, fine-tuning on labeled datasets that may be substantially smaller than those required to train a structure encoder from scratch. The framework is implemented on TorchDrug and is accessible to researchers with standard GPU hardware.
ESM-GearNet represents an important empirical contribution to the emerging consensus that protein representation learning should leverage both sequence and structure. By demonstrating that simple serial fusion of a pre-trained PLM with a geometric encoder outperforms more complex attention-based fusion on standard benchmarks, the paper provides clear architectural guidance that has influenced subsequent joint representation approaches. The systematic nature of the study — covering dozens of design choices in a controlled fashion — makes it a useful reference for practitioners choosing between fusion strategies. A limitation of the current work is its dependence on predicted structures from AlphaFold, which introduces biases from the structure prediction process and may underperform on proteins where AlphaFold confidence is low. The framework also benchmarks primarily on function annotation; how well joint representations transfer to protein design or fitness landscape modeling remains an open question.