Overview

AntiBERTa is an antibody-specific language model developed by Alchemab Therapeutics and published in Cell Patterns in May 2022. The model applies a BERT-style transformer architecture to antibody sequences, treating the amino acid string of a B cell receptor (BCR) as a natural language that encodes structural and functional information. Where general-purpose protein language models such as ESM are trained across the entire protein universe, AntiBERTa is deliberately specialized: every parameter is shaped by the distinctive sequence grammar of antibodies, including the hypervariable complementarity-determining regions (CDRs) that determine antigen binding.

The model was pre-trained using masked language modeling on 57 million unpaired BCR sequences drawn from the Observed Antibody Space (OAS) database — 42 million heavy-chain and 15 million light-chain sequences. It was subsequently fine-tuned on paratope prediction using structural data from the Structural Antibody Database (SAbDab). Training was performed on NVIDIA's Cambridge-1 supercomputer. AntiBERTa demonstrated state-of-the-art performance on paratope prediction at the time of publication, outperforming earlier sequence- and structure-based methods across multiple evaluation metrics.

Key Features

Antibody-specialized pre-training: Trained exclusively on BCR sequences, allowing the model to capture antibody-specific sequence patterns — including CDR hypervariability and framework region conservation — that general protein language models dilute across diverse fold families.
Masked language modeling objective: Self-supervised pre-training via random residue masking forces the model to learn contextual dependencies across the full heavy- or light-chain sequence without requiring labeled experimental data at scale.
State-of-the-art paratope prediction: After fine-tuning on SAbDab structures, AntiBERTa identifies the residues comprising the antigen-binding paratope with accuracy that surpassed prior computational approaches at time of publication.
Convergent antibody identification: Contextualized sequence embeddings enable clustering of antibodies from different donors that share binding specificity, supporting discovery of broadly relevant immune signatures.
Plug-and-play Hugging Face integration: Built on the Hugging Face Transformers library, making the model straightforward to load, fine-tune, and deploy within standard Python ML workflows.

Technical Details

AntiBERTa is a 12-layer bidirectional transformer following the BERT architecture. Each layer applies multi-head self-attention over the full antibody sequence, allowing residues anywhere in the chain to influence the representation of any other residue. This global receptive field is critical for antibodies because CDR loops are spatially proximal in the folded structure even when separated in sequence. The model was implemented in PyTorch 1.9.0 using Hugging Face Transformers 4.7.0, and tokenization operates at the single amino acid level.

Pre-training data comprised 42 million unpaired heavy-chain and 15 million unpaired light-chain BCR sequences from OAS, covering natural antibody diversity across human donors and disease states. Fine-tuning for paratope prediction used a snapshot of SAbDab from August 2021. The model generates per-residue embeddings that can be read out directly for token-classification tasks (paratope labeling) or aggregated for sequence-level tasks. On paratope prediction benchmarks, AntiBERTa improved upon prior methods including Parapred and DeepAb in precision, recall, and F1 at residue level. No parameter count is reported in the primary publication, though the 12-layer BERT-base configuration implies approximately 110 million parameters.

Applications

AntiBERTa is used in antibody drug discovery and immunological research workflows where sequence-level understanding of BCR repertoires is needed. Therapeutic antibody programs can apply the model to rank and filter candidate sequences by predicted paratope quality or to identify convergent antibodies across patient cohorts that may indicate shared protective responses. B cell repertoire analysis pipelines benefit from the rich contextual embeddings for clustering and annotation tasks. Vaccine researchers can use the model to characterize antibody responses following immunization or infection, identifying sequences enriched in antigen-specific binding positions. The model is also appropriate as a sequence encoder in multi-task learning settings where paratope annotation is one component of a broader developability or affinity prediction pipeline.

Impact

AntiBERTa established that domain-specialized pre-training on antibody sequences yields meaningful gains over general protein language models for antibody-centric tasks, influencing subsequent work such as AntiBERTa2 and IgLM. The paper demonstrated that self-supervised learning on large unlabeled BCR repertoire data transfers effectively to supervised tasks with limited structural annotations — a practically important finding given how sparse experimentally determined antibody-antigen structures remain relative to the size of natural repertoires. A key limitation is that AntiBERTa encodes heavy and light chains independently rather than as paired sequences, which means it does not model the interface between the two chains that shapes the complete binding site. This gap was a direct motivation for the paired-sequence models that followed. The model weights and training code are publicly available on GitHub, enabling reproducibility and community fine-tuning on new task-specific datasets.

Overview

Key Features

Antibody-specialized pre-training: Trained exclusively on BCR sequences, allowing the model to capture antibody-specific sequence patterns — including CDR hypervariability and framework region conservation — that general protein language models dilute across diverse fold families.

Masked language modeling objective: Self-supervised pre-training via random residue masking forces the model to learn contextual dependencies across the full heavy- or light-chain sequence without requiring labeled experimental data at scale.

State-of-the-art paratope prediction: After fine-tuning on SAbDab structures, AntiBERTa identifies the residues comprising the antigen-binding paratope with accuracy that surpassed prior computational approaches at time of publication.

Convergent antibody identification: Contextualized sequence embeddings enable clustering of antibodies from different donors that share binding specificity, supporting discovery of broadly relevant immune signatures.

Plug-and-play Hugging Face integration: Built on the Hugging Face Transformers library, making the model straightforward to load, fine-tune, and deploy within standard Python ML workflows.

Technical Details

Applications

Impact

AntiBERTa

Overview

Key Features

Technical Details

Applications

Impact

Citation

Deciphering the language of antibodies using self-supervised learning

Metrics

GitHub

Citations

Tags

Resources

AntiBERTa

Overview

Key Features

Technical Details

Applications

Impact

Citation

Deciphering the language of antibodies using self-supervised learning

Metrics

GitHub

Citations

Tags

Resources