Overview

IgLM (Immunoglobulin Language Model) is a deep generative language model developed at the Gray Lab at Johns Hopkins University for the design of synthetic antibody libraries. The model addresses a fundamental limitation of earlier autoregressive sequence generators: because those models read sequences strictly left-to-right, they cannot use downstream context when designing internal regions such as complementarity-determining region (CDR) loops. IgLM solves this by framing antibody design as a text-infilling problem — the same formulation used in natural language processing for fill-in-the-blank tasks — allowing the model to draw on both upstream and downstream sequence context when generating variable-length spans within an existing antibody framework.

Trained on 558 million heavy- and light-chain variable region sequences, IgLM captures the statistical patterns governing functional antibody sequences across multiple species. Each training sequence is conditioned on its chain type (heavy or light) and species of origin, giving the model explicit control over these biological properties during generation. This conditioning means researchers can direct the model to produce human-like sequences, mouse sequences, or other species-specific variants without post-hoc filtering.

Published in Cell Systems in November 2023 by Richard Shuai, Jeffrey Ruffolo, and Jeffrey Gray, IgLM was validated by generating libraries for 49 therapeutic antibody targets and evaluating the resulting sequences against multiple in silico developability metrics.

Key Features

Bidirectional infilling: By rearranging masked spans to the end of the input sequence during training, IgLM learns to generate arbitrary-length sequence segments conditioned on both the preceding and following context — a capability unavailable to standard left-to-right language models.
Massive training corpus: The model was trained on 558 million antibody heavy- and light-chain variable sequences, providing broad coverage of natural immunoglobulin sequence diversity across species and repertoires.
Chain type and species conditioning: Control tokens specify whether the output should be a heavy or light chain and which species it should resemble, enabling targeted generation of human-like sequences directly rather than relying on downstream humanization steps.
CDR loop diversification: IgLM can generate diverse libraries targeting specific CDR loops (including the highly variable CDR H3) while the surrounding framework sequence remains fixed, producing variants that retain the original antibody's binding region geometry.
Improved in silico developability: Libraries generated by IgLM display lower predicted aggregation propensity and improved estimated solubility and human-likeness compared to parent sequences, as assessed by OASis and BioPhi scoring tools.
Flexible generation modes: The model supports both full-length antibody sequence generation from scratch and targeted infilling of specific sub-regions within an existing sequence, making it adaptable to a wide range of design scenarios.

Technical Details

IgLM uses a decoder-only transformer architecture adapted from GPT-2 with an embedding and hidden dimension of 512, feed-forward layers of size 2048, 4 transformer layers, and 8 attention heads per layer. This yields approximately 12.9 million trainable parameters — a deliberately compact model that can run efficiently on a single GPU. The infilling capability is achieved without modifying the underlying architecture: during training, randomly selected sequence spans are masked and moved to the end of the sequence, then the model is trained to predict them autoregressively. At inference time, this rearrangement strategy allows the model to condition generation on surrounding context that would be inaccessible to a standard left-to-right decoder.

The training dataset of 558 million sequences was drawn from the Observed Antibody Space (OAS) database, representing paired and unpaired heavy and light chain sequences from human, mouse, rat, rabbit, rhesus macaque, and other species. IgLM achieves an average infilling perplexity of 1.53 on the held-out test set. In comparative evaluations, IgLM-generated sequences showed lower perplexity (higher naturalness) than ProGen2-OAS-generated sequences for CDR infilling tasks, and the generated libraries consistently scored higher on human-likeness metrics.

Applications

IgLM is suited to any antibody engineering workflow that involves library generation, lead optimization, or humanization. Researchers can use it to diversify CDR loops of a known antibody binder while preserving the framework — a common need in affinity maturation campaigns where variants of a lead compound are explored. It can also generate full-length antibody sequences from scratch for use as starting scaffolds in de novo discovery. Because the model conditions on species and chain type, it can be used to produce humanized variants of murine antibodies directly, reducing the need for iterative manual grafting. The model was specifically benchmarked against 49 therapeutic antibody targets, demonstrating applicability to real-world drug discovery pipelines.

Impact

IgLM established infilling as a productive paradigm for antibody sequence design and demonstrated that a relatively compact language model — fewer than 13 million parameters — trained exclusively on immunoglobulin sequences can outperform much larger general-purpose protein language models on antibody-specific tasks. The publicly available code and pre-trained weights under the JHU Academic Software License have made it accessible to academic groups and contributed to a growing ecosystem of antibody-focused generative models, including subsequent work such as p-IgGen and AbGPT. A key limitation is that IgLM operates at the sequence level only and does not incorporate structural information; predicted sequences must be separately evaluated for structure, binding affinity, and experimental developability properties before wet-lab synthesis.

Overview

Key Features

Bidirectional infilling: By rearranging masked spans to the end of the input sequence during training, IgLM learns to generate arbitrary-length sequence segments conditioned on both the preceding and following context — a capability unavailable to standard left-to-right language models.

Massive training corpus: The model was trained on 558 million antibody heavy- and light-chain variable sequences, providing broad coverage of natural immunoglobulin sequence diversity across species and repertoires.

Chain type and species conditioning: Control tokens specify whether the output should be a heavy or light chain and which species it should resemble, enabling targeted generation of human-like sequences directly rather than relying on downstream humanization steps.

CDR loop diversification: IgLM can generate diverse libraries targeting specific CDR loops (including the highly variable CDR H3) while the surrounding framework sequence remains fixed, producing variants that retain the original antibody's binding region geometry.

Improved in silico developability: Libraries generated by IgLM display lower predicted aggregation propensity and improved estimated solubility and human-likeness compared to parent sequences, as assessed by OASis and BioPhi scoring tools.

Flexible generation modes: The model supports both full-length antibody sequence generation from scratch and targeted infilling of specific sub-regions within an existing sequence, making it adaptable to a wide range of design scenarios.

Technical Details

Applications

Impact

IgLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

IgLM: Infilling language modeling for antibody sequence design

Metrics

GitHub

Citations

Tags

Resources

IgLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

IgLM: Infilling language modeling for antibody sequence design

Metrics

GitHub

Citations

Tags

Resources