Huazhong University of Science and Technology / Microsoft Research
Graph neural network framework for antigen-specific antibody CDR design, combining a pre-trained antibody language model with one-shot sequence and structure generation.
ABGNN is a computational framework for antigen-specific antibody design that combines a pre-trained antibody language model with a hierarchical graph neural network. Presented at the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023), the system addresses a core challenge in antibody engineering: jointly generating the amino acid sequence and three-dimensional structure of complementarity-determining regions (CDRs) given a specific antigen context.
Traditional approaches to CDR design generate amino acids autoregressively — one residue at a time — which accumulates prediction errors and is computationally expensive. ABGNN replaces this sequential procedure with a one-shot generation strategy that simultaneously predicts all residues in a CDR loop. This is made possible by a pre-trained antibody language model called AbBERT, which supplies rich sequence-level embeddings that inform both the sequence and structure generation components of the framework.
ABGNN was developed through a collaboration between Kaiyuan Gao at Huazhong University of Science and Technology (HUST) and researchers at Microsoft Research AI4Science, including Lijun Wu. The work builds on prior work in geometric deep learning for molecular design and adapts transformer-based pre-training — well established in protein language modeling — specifically to the narrow but immunologically critical domain of antibody sequences.
The ABGNN framework operates in two stages. In the pre-training stage, AbBERT is trained using a masked language modeling objective on antibody heavy and light chain sequences from OAS. This gives the model an understanding of antibody-specific sequence grammar that general protein language models may not capture with equal fidelity, since antibody sequences occupy a restricted region of sequence space shaped by V(D)J recombination and somatic hypermutation.
In the fine-tuning stage, the soft output distribution from AbBERT is passed to Hseq, a graph neural network where nodes represent residues and edges encode spatial proximity or sequence adjacency within the CDR scaffold. Hseq refines the sequence representation, which is then passed to Hstr for coordinate prediction. The two networks are trained jointly on antigen-antibody complex data. Fine-tuning experiments use the MEAN dataset for CDR generation benchmarks and HSRN docking data for antigen-binding tasks.
On the CDR-H3 design benchmark, ABGNN achieves an amino acid recovery rate (AAR) of 39.63% and a structural RMSD of 1.56 Angstroms, improving on the prior MEAN baseline by approximately 3 percentage points AAR. Antigen-binding evaluation is conducted on 60 test complexes spanning diverse antigen types.
ABGNN is primarily aimed at computational antibody discovery teams working on therapeutic antibody development. The framework is relevant when a target antigen structure is known and the goal is to generate CDR sequences — particularly the CDR-H3 loop, which dominates antigen contacts — that are both structurally plausible and likely to bind. This positions ABGNN within workflows for hit generation from antigen structures, complementing experimental display methods such as phage or yeast display by providing a ranked set of candidate sequences for synthesis and testing. The antibody optimization mode is additionally relevant for lead maturation, where an existing antibody with marginal affinity needs systematic sequence improvement.
ABGNN contributes to a growing body of work applying pre-training paradigms — originally developed for natural language and general protein modeling — to the more specialized domain of antibody engineering. Its one-shot generation approach has influenced subsequent methods that similarly reject autoregressive CDR decoding in favor of parallel prediction. The model's explicit coupling of sequence and structure prediction, rather than treating them as independent tasks, reflects a broader trend toward co-design frameworks in computational protein engineering. A practical limitation is that ABGNN requires antigen structural information as input, which restricts its direct applicability to targets without known or predicted structures, though the widespread availability of AlphaFold 2 predictions partially mitigates this constraint. The codebase is publicly available and was trained and evaluated on standard benchmark datasets, enabling direct comparison with subsequent methods in the field.
Gao, K., et al. (2023) Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design. Knowledge Discovery and Data Mining.
DOI: 10.1145/3580305.3599468