Overview

AbLang is a transformer-based language model developed by the Oxford Protein Informatics Group (OPIG) and trained exclusively on antibody sequences from the Observed Antibody Space (OAS) database. Released in 2022, it is purpose-built for antibody-specific tasks rather than general protein modelling, enabling it to capture the distinctive sequence patterns of immunoglobulins more precisely than models trained on broad protein corpora.

The central problem AbLang addresses is the prevalence of incomplete sequences in large-scale antibody datasets. Over 40% of sequences in OAS are missing the first 15 N-terminal residues, a systematic artefact of common high-throughput sequencing protocols. These truncated sequences cannot be reliably used for structure prediction, similarity analysis, or therapeutic development without restoration. AbLang solves this by treating missing-residue recovery as a masked-language-modelling task and learning the strong positional and compositional biases inherent to antibody N-termini.

By training separate models for heavy and light chains, AbLang achieves domain-specific representations that outperform general protein language models such as ESM-1b on antibody benchmarks, while running approximately seven times faster — an important practical advantage when processing millions of sequences from immune repertoire studies.

Key Features

Antibody-Specific Training: Trained on 14.1 million heavy chain and 187,000 light chain sequences from OAS, giving the model deep familiarity with immunoglobulin sequence statistics and CDR diversity.
High-Accuracy Sequence Restoration: Achieves 98% accuracy for heavy chains and 96% for light chains when restoring the first 15 N-terminal positions, outperforming IMGT germline-based imputation methods.
Efficient Inference: Runs approximately 7x faster than ESM-1b, making it practical for large-scale repertoire analysis and batch processing of millions of sequences.
Germline-Independent Predictions: Does not require knowledge of the germline gene of origin, broadening applicability to non-human antibodies, synthetic libraries, and highly mutated therapeutic candidates.
Dual-Component Architecture: AbRep generates 768-dimensional per-residue embeddings suitable for downstream tasks, while AbHead converts those embeddings into per-position amino acid probability distributions for sequence completion.

Technical Details

AbLang is built on a BERT-style transformer architecture with 12 attention blocks, 12 attention heads per block, a hidden dimension of 768, and a feedforward inner dimension of 3,072. The maximum sequence length is 160 positions. Positional embeddings are learned during training rather than fixed. Separate model instances are trained for heavy and light chains, each with the same architecture but distinct weights and training corpora. The combined parameter count is on the order of 86 million parameters per chain model.

Heavy chain models were trained for 20 epochs with a batch size of 8,192; light chain models for 40 epochs with a batch size of 4,096. Both used the Adam optimizer with cosine learning rate decay and a peak learning rate of 0.0002. Training data was sourced from the publicly available OAS database. On N-terminal restoration benchmarks, AbLang achieves 98% per-position accuracy for heavy chains and 96% for light chains over the first 15 positions — a substantial improvement over both IMGT-based germline assignment and the general-purpose ESM-1b model tested on the same task.

Applications

AbLang is primarily used to rescue incomplete sequences from large antibody sequencing campaigns, enabling researchers to restore millions of OAS entries that would otherwise be excluded from downstream analyses. The high-quality embeddings produced by AbRep are also used as input features for machine learning models that predict binding affinity, immunogenicity, developability, and stability of antibody candidates. In structural biology workflows, AbLang-completed sequences can be passed directly to antibody structure prediction tools such as ABodyBuilder or AlphaFold-Multimer, which require full-length input. Immune repertoire studies benefit from standardized, complete sequences when comparing diversity metrics or identifying clonally related antibodies across samples.

Impact

AbLang demonstrated that domain-specific language models — even when built on a fraction of the data used by general protein models — can outperform their broader counterparts on specialised tasks. The model has been adopted within computational antibody discovery pipelines and influenced subsequent antibody-specific modelling efforts, including AbLang-2, which extended the approach with heavier training data and improved CDR loop representations. A key limitation is the focus on sequence-level modelling: AbLang does not predict 3D structure directly, and its representations may not fully capture conformational flexibility in the highly variable CDR-H3 loop. Additionally, the light chain training set (187K sequences) is substantially smaller than the heavy chain corpus, which may limit generalisation to rare light chain subtypes.

Overview

Key Features

Antibody-Specific Training: Trained on 14.1 million heavy chain and 187,000 light chain sequences from OAS, giving the model deep familiarity with immunoglobulin sequence statistics and CDR diversity.

High-Accuracy Sequence Restoration: Achieves 98% accuracy for heavy chains and 96% for light chains when restoring the first 15 N-terminal positions, outperforming IMGT germline-based imputation methods.

Efficient Inference: Runs approximately 7x faster than ESM-1b, making it practical for large-scale repertoire analysis and batch processing of millions of sequences.

Germline-Independent Predictions: Does not require knowledge of the germline gene of origin, broadening applicability to non-human antibodies, synthetic libraries, and highly mutated therapeutic candidates.

Dual-Component Architecture: AbRep generates 768-dimensional per-residue embeddings suitable for downstream tasks, while AbHead converts those embeddings into per-position amino acid probability distributions for sequence completion.

Technical Details

Applications

Impact

AbLang

Overview

Key Features

Technical Details

Applications

Impact

Citation

AbLang: an antibody language model for completing antibody sequences

Metrics

GitHub

Citations

Tags

Resources

AbLang

Overview

Key Features

Technical Details

Applications

Impact

Citation

AbLang: an antibody language model for completing antibody sequences

Metrics

GitHub

Citations

Tags

Resources