AntiBERTa is an antibody-specific language model developed by Alchemab Therapeutics and published in Cell Patterns in May 2022. The model applies a BERT-style transformer architecture to antibody sequences, treating the amino acid string of a B cell receptor (BCR) as a natural language that encodes structural and functional information. Where general-purpose protein language models such as ESM are trained across the entire protein universe, AntiBERTa is deliberately specialized: every parameter is shaped by the distinctive sequence grammar of antibodies, including the hypervariable complementarity-determining regions (CDRs) that determine antigen binding.
The model was pre-trained using masked language modeling on 57 million unpaired BCR sequences drawn from the Observed Antibody Space (OAS) database — 42 million heavy-chain and 15 million light-chain sequences. It was subsequently fine-tuned on paratope prediction using structural data from the Structural Antibody Database (SAbDab). Training was performed on NVIDIA's Cambridge-1 supercomputer. AntiBERTa demonstrated state-of-the-art performance on paratope prediction at the time of publication, outperforming earlier sequence- and structure-based methods across multiple evaluation metrics.
AntiBERTa is a 12-layer bidirectional transformer following the BERT architecture. Each layer applies multi-head self-attention over the full antibody sequence, allowing residues anywhere in the chain to influence the representation of any other residue. This global receptive field is critical for antibodies because CDR loops are spatially proximal in the folded structure even when separated in sequence. The model was implemented in PyTorch 1.9.0 using Hugging Face Transformers 4.7.0, and tokenization operates at the single amino acid level.
Pre-training data comprised 42 million unpaired heavy-chain and 15 million unpaired light-chain BCR sequences from OAS, covering natural antibody diversity across human donors and disease states. Fine-tuning for paratope prediction used a snapshot of SAbDab from August 2021. The model generates per-residue embeddings that can be read out directly for token-classification tasks (paratope labeling) or aggregated for sequence-level tasks. On paratope prediction benchmarks, AntiBERTa improved upon prior methods including Parapred and DeepAb in precision, recall, and F1 at residue level. No parameter count is reported in the primary publication, though the 12-layer BERT-base configuration implies approximately 110 million parameters.
AntiBERTa is used in antibody drug discovery and immunological research workflows where sequence-level understanding of BCR repertoires is needed. Therapeutic antibody programs can apply the model to rank and filter candidate sequences by predicted paratope quality or to identify convergent antibodies across patient cohorts that may indicate shared protective responses. B cell repertoire analysis pipelines benefit from the rich contextual embeddings for clustering and annotation tasks. Vaccine researchers can use the model to characterize antibody responses following immunization or infection, identifying sequences enriched in antigen-specific binding positions. The model is also appropriate as a sequence encoder in multi-task learning settings where paratope annotation is one component of a broader developability or affinity prediction pipeline.
AntiBERTa established that domain-specialized pre-training on antibody sequences yields meaningful gains over general protein language models for antibody-centric tasks, influencing subsequent work such as AntiBERTa2 and IgLM. The paper demonstrated that self-supervised learning on large unlabeled BCR repertoire data transfers effectively to supervised tasks with limited structural annotations — a practically important finding given how sparse experimentally determined antibody-antigen structures remain relative to the size of natural repertoires. A key limitation is that AntiBERTa encodes heavy and light chains independently rather than as paired sequences, which means it does not model the interface between the two chains that shapes the complete binding site. This gap was a direct motivation for the paired-sequence models that followed. The model weights and training code are publicly available on GitHub, enabling reproducibility and community fine-tuning on new task-specific datasets.
Leem, J., Mitchell, L.S., Farmery, J.H.R., Barton, J., & Galson, J.D. (2022). Deciphering the language of antibodies using self-supervised learning. Patterns, 3(7), 100513.
DOI: 10.1016/j.patter.2022.100513