A RoBERTa transformer pretrained solely on antibody heavy-chain CDR-H3 sequences, producing embeddings for repertoire analysis, B-cell maturation profiling, and bnAb classification.
H3BERTa is a protein language model trained exclusively on the third complementarity-determining region of the antibody heavy chain (CDR-H3), the loop that contributes most of the diversity and antigen-binding specificity of an antibody. Whereas most antibody language models learn over the entire variable domain, H3BERTa concentrates its modeling capacity on this single hypervariable region, testing the hypothesis that CDR-H3 alone carries enough signal to characterize immune repertoires. It was developed by Chiara Rodella and Thomas Lemmin at the Institute of Biochemistry and Molecular Medicine, University of Bern, and released as a bioRxiv preprint in November 2025.
The model is positioned against two related approaches it explicitly distinguishes itself from: AntiBERTa, which is trained on full variable-region sequences, and the CDR-Masked Paired Antibody Language Model, which fine-tunes the general-purpose ESM-2 protein model. H3BERTa instead pretrains a RoBERTa encoder from scratch on CDR-H3 sequences only, so that its learned representations are shaped entirely by the statistics of this region.
By focusing on CDR-H3, H3BERTa aims to provide a lightweight, region-specific foundation for antibody repertoire analysis — extracting embeddings that reflect germline gene usage and B-cell maturation state, scoring sequences by pseudo-perplexity, and serving as a frozen feature extractor for downstream classification tasks with limited labeled data.
H3BERTa adopts a RoBERTa-base encoder architecture (an encoder-only transformer trained with masked language modeling) with approximately 85.7M parameters, a maximum sequence length of 100 amino acids, and a compact 25-token vocabulary covering the 20 standard amino acids plus special tokens. Pretraining used more than 17 million curated CDR-H3 sequences drawn from healthy-donor repertoires in the Observed Antibody Space (IgG and IgA isotypes). The authors evaluate the model through three lenses: zero-shot embedding analysis (recovering J-gene usage and maturation signals), pseudo-perplexity comparisons separating healthy from HIV-1-derived repertoires, and few-shot classification of broadly neutralizing antibodies built on frozen embeddings with SVM and GAN-BERTa classifiers. The datasets used for training, validation, and testing, along with all trained model weights, are deposited on Zenodo and mirrored from the GitHub repository.
H3BERTa is aimed at immunologists and antibody engineers who analyze large B-cell receptor repertoires. Its embeddings can characterize repertoire composition, track germline gene usage, and infer maturation state directly from CDR-H3 sequences, while its pseudo-perplexity scores provide a way to compare healthy and disease-associated repertoires at scale. As a frozen feature extractor, it lowers the labeled-data barrier for downstream tasks such as identifying candidate broadly neutralizing antibodies against HIV-1, making it a practical building block for therapeutic antibody discovery and immune monitoring pipelines.
H3BERTa contributes to a growing line of antibody-specific language models by demonstrating that a model trained on CDR-H3 alone can recover biologically meaningful structure — germline usage, maturation, and disease-associated repertoire differences — rather than requiring the full variable region or a general-purpose protein backbone. Its open release of weights, code, and datasets under a permissive license makes it readily reusable for repertoire analysis and antibody-discovery research. As a recent preprint, its broader adoption and benchmark standing relative to AntiBERTa and ESM-2-based approaches remain to be established through independent evaluation, and the model card does not yet report standardized benchmark metrics or an explicit limitations discussion.
Rodella, C. & Lemmin, T. (2025) H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis. bioRxiv.
DOI: 10.1101/2025.11.03.686198Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data