Overview

ABGNN is a computational framework for antigen-specific antibody design that combines a pre-trained antibody language model with a hierarchical graph neural network. Presented at the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023), the system addresses a core challenge in antibody engineering: jointly generating the amino acid sequence and three-dimensional structure of complementarity-determining regions (CDRs) given a specific antigen context.

Traditional approaches to CDR design generate amino acids autoregressively — one residue at a time — which accumulates prediction errors and is computationally expensive. ABGNN replaces this sequential procedure with a one-shot generation strategy that simultaneously predicts all residues in a CDR loop. This is made possible by a pre-trained antibody language model called AbBERT, which supplies rich sequence-level embeddings that inform both the sequence and structure generation components of the framework.

ABGNN was developed through a collaboration between Kaiyuan Gao at Huazhong University of Science and Technology (HUST) and researchers at Microsoft Research AI4Science, including Lijun Wu. The work builds on prior work in geometric deep learning for molecular design and adapts transformer-based pre-training — well established in protein language modeling — specifically to the narrow but immunologically critical domain of antibody sequences.

Key Features

Pre-trained AbBERT language model: AbBERT is trained on antibody sequences from the Observed Antibody Space (OAS) database, learning statistical patterns specific to antibody variable regions. Its embeddings serve as soft predictions that initialize the downstream GNN components, providing a strong prior on sequence plausibility.
One-shot CDR generation: Unlike autoregressive decoders that generate residues left-to-right and accumulate errors, ABGNN generates all CDR positions simultaneously. This reduces error propagation and improves computational efficiency at inference time.
Hierarchical sequence and structure GNNs: The framework uses two coupled graph neural networks — a sequence GNN (Hseq) that refines amino acid identities across the CDR graph, and a structure GNN (Hstr) that predicts backbone and side-chain coordinates. The two modules share information iteratively to enforce sequence-structure consistency.
Antigen-conditioned design: CDR-H3 generation is conditioned on the antigen structure, allowing the model to design loops that are geometrically and chemically complementary to a specified target epitope rather than generating antibodies in isolation.
Antibody optimization support: Beyond de novo design, ABGNN supports optimization of existing antibody sequences (e.g., for improved binding affinity), demonstrated on COVID-19 antibody candidates using the SAbDab dataset.

Technical Details

The ABGNN framework operates in two stages. In the pre-training stage, AbBERT is trained using a masked language modeling objective on antibody heavy and light chain sequences from OAS. This gives the model an understanding of antibody-specific sequence grammar that general protein language models may not capture with equal fidelity, since antibody sequences occupy a restricted region of sequence space shaped by V(D)J recombination and somatic hypermutation.

In the fine-tuning stage, the soft output distribution from AbBERT is passed to Hseq, a graph neural network where nodes represent residues and edges encode spatial proximity or sequence adjacency within the CDR scaffold. Hseq refines the sequence representation, which is then passed to Hstr for coordinate prediction. The two networks are trained jointly on antigen-antibody complex data. Fine-tuning experiments use the MEAN dataset for CDR generation benchmarks and HSRN docking data for antigen-binding tasks.

On the CDR-H3 design benchmark, ABGNN achieves an amino acid recovery rate (AAR) of 39.63% and a structural RMSD of 1.56 Angstroms, improving on the prior MEAN baseline by approximately 3 percentage points AAR. Antigen-binding evaluation is conducted on 60 test complexes spanning diverse antigen types.

Applications

ABGNN is primarily aimed at computational antibody discovery teams working on therapeutic antibody development. The framework is relevant when a target antigen structure is known and the goal is to generate CDR sequences — particularly the CDR-H3 loop, which dominates antigen contacts — that are both structurally plausible and likely to bind. This positions ABGNN within workflows for hit generation from antigen structures, complementing experimental display methods such as phage or yeast display by providing a ranked set of candidate sequences for synthesis and testing. The antibody optimization mode is additionally relevant for lead maturation, where an existing antibody with marginal affinity needs systematic sequence improvement.

Impact

ABGNN contributes to a growing body of work applying pre-training paradigms — originally developed for natural language and general protein modeling — to the more specialized domain of antibody engineering. Its one-shot generation approach has influenced subsequent methods that similarly reject autoregressive CDR decoding in favor of parallel prediction. The model's explicit coupling of sequence and structure prediction, rather than treating them as independent tasks, reflects a broader trend toward co-design frameworks in computational protein engineering. A practical limitation is that ABGNN requires antigen structural information as input, which restricts its direct applicability to targets without known or predicted structures, though the widespread availability of AlphaFold 2 predictions partially mitigates this constraint. The codebase is publicly available and was trained and evaluated on standard benchmark datasets, enabling direct comparison with subsequent methods in the field.

Overview

Key Features

Pre-trained AbBERT language model: AbBERT is trained on antibody sequences from the Observed Antibody Space (OAS) database, learning statistical patterns specific to antibody variable regions. Its embeddings serve as soft predictions that initialize the downstream GNN components, providing a strong prior on sequence plausibility.

One-shot CDR generation: Unlike autoregressive decoders that generate residues left-to-right and accumulate errors, ABGNN generates all CDR positions simultaneously. This reduces error propagation and improves computational efficiency at inference time.

Hierarchical sequence and structure GNNs: The framework uses two coupled graph neural networks — a sequence GNN (Hseq) that refines amino acid identities across the CDR graph, and a structure GNN (Hstr) that predicts backbone and side-chain coordinates. The two modules share information iteratively to enforce sequence-structure consistency.

Antigen-conditioned design: CDR-H3 generation is conditioned on the antigen structure, allowing the model to design loops that are geometrically and chemically complementary to a specified target epitope rather than generating antibodies in isolation.

Antibody optimization support: Beyond de novo design, ABGNN supports optimization of existing antibody sequences (e.g., for improved binding affinity), demonstrated on COVID-19 antibody candidates using the SAbDab dataset.

Technical Details

Applications

Impact

ABGNN

Overview

Key Features

Technical Details

Applications

Impact

Citation

Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design

Metrics

GitHub

Citations

Tags

Resources

ABGNN

Overview

Key Features

Technical Details

Applications

Impact

Citation

Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design

Metrics

GitHub

Citations

Tags

Resources