H3BERTa is a protein language model trained exclusively on the third complementarity-determining region of the antibody heavy chain (CDR-H3), the loop that contributes most of the diversity and antigen-binding specificity of an antibody. Whereas most antibody language models learn over the entire variable domain, H3BERTa concentrates its modeling capacity on this single hypervariable region, testing the hypothesis that CDR-H3 alone carries enough signal to characterize immune repertoires. It was developed by Chiara Rodella and Thomas Lemmin at the Institute of Biochemistry and Molecular Medicine, University of Bern, and released as a bioRxiv preprint in November 2025.

The model is positioned against two related approaches it explicitly distinguishes itself from: AntiBERTa, which is trained on full variable-region sequences, and the CDR-Masked Paired Antibody Language Model, which fine-tunes the general-purpose ESM-2 protein model. H3BERTa instead pretrains a RoBERTa encoder from scratch on CDR-H3 sequences only, so that its learned representations are shaped entirely by the statistics of this region.

By focusing on CDR-H3, H3BERTa aims to provide a lightweight, region-specific foundation for antibody repertoire analysis — extracting embeddings that reflect germline gene usage and B-cell maturation state, scoring sequences by pseudo-perplexity, and serving as a frozen feature extractor for downstream classification tasks with limited labeled data.

Key Features

CDR-H3-only pretraining: Trained on more than 17 million curated CDR-H3 sequences from healthy-donor repertoires (Observed Antibody Space, IgG/IgA), concentrating the model on the most diverse antibody region rather than the full variable domain.
Informative zero-shot embeddings: Without any task-specific fine-tuning, the learned representations capture J-gene usage and patterns associated with B-cell maturation, indicating the model encodes biologically meaningful structure.
Pseudo-perplexity profiling: Masked-language-model scoring distinguishes healthy repertoires from HIV-1-derived repertoires, offering a repertoire-level signal of immune state.
Few-shot bnAb classification: Frozen H3BERTa embeddings support few-shot classifiers that flag candidate HIV-1 broadly neutralizing antibodies, useful when labeled examples are scarce.
Open release: Model weights, training code, and the curated datasets are publicly available on HuggingFace, GitHub, and Zenodo under an MIT license.

Technical Details

H3BERTa adopts a RoBERTa-base encoder architecture (an encoder-only transformer trained with masked language modeling) with approximately 85.7M parameters, a maximum sequence length of 100 amino acids, and a compact 25-token vocabulary covering the 20 standard amino acids plus special tokens. Pretraining used more than 17 million curated CDR-H3 sequences drawn from healthy-donor repertoires in the Observed Antibody Space (IgG and IgA isotypes). The authors evaluate the model through three lenses: zero-shot embedding analysis (recovering J-gene usage and maturation signals), pseudo-perplexity comparisons separating healthy from HIV-1-derived repertoires, and few-shot classification of broadly neutralizing antibodies built on frozen embeddings with SVM and GAN-BERTa classifiers. The datasets used for training, validation, and testing, along with all trained model weights, are deposited on Zenodo and mirrored from the GitHub repository.

Applications

H3BERTa is aimed at immunologists and antibody engineers who analyze large B-cell receptor repertoires. Its embeddings can characterize repertoire composition, track germline gene usage, and infer maturation state directly from CDR-H3 sequences, while its pseudo-perplexity scores provide a way to compare healthy and disease-associated repertoires at scale. As a frozen feature extractor, it lowers the labeled-data barrier for downstream tasks such as identifying candidate broadly neutralizing antibodies against HIV-1, making it a practical building block for therapeutic antibody discovery and immune monitoring pipelines.

Impact

H3BERTa contributes to a growing line of antibody-specific language models by demonstrating that a model trained on CDR-H3 alone can recover biologically meaningful structure — germline usage, maturation, and disease-associated repertoire differences — rather than requiring the full variable region or a general-purpose protein backbone. Its open release of weights, code, and datasets under a permissive license makes it readily reusable for repertoire analysis and antibody-discovery research. As a recent preprint, its broader adoption and benchmark standing relative to AntiBERTa and ESM-2-based approaches remain to be established through independent evaluation, and the model card does not yet report standardized benchmark metrics or an explicit limitations discussion.

Key Features

CDR-H3-only pretraining: Trained on more than 17 million curated CDR-H3 sequences from healthy-donor repertoires (Observed Antibody Space, IgG/IgA), concentrating the model on the most diverse antibody region rather than the full variable domain.

Informative zero-shot embeddings: Without any task-specific fine-tuning, the learned representations capture J-gene usage and patterns associated with B-cell maturation, indicating the model encodes biologically meaningful structure.

Pseudo-perplexity profiling: Masked-language-model scoring distinguishes healthy repertoires from HIV-1-derived repertoires, offering a repertoire-level signal of immune state.

Few-shot bnAb classification: Frozen H3BERTa embeddings support few-shot classifiers that flag candidate HIV-1 broadly neutralizing antibodies, useful when labeled examples are scarce.

Open release: Model weights, training code, and the curated datasets are publicly available on HuggingFace, GitHub, and Zenodo under an MIT license.

Technical Details

Applications

Impact

H3BERTa

Key Features

Technical Details

Applications

Impact

Citation

H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

H3BERTa

Key Features

Technical Details

Applications

Impact

Citation

H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

H3BERTa

#Key Features

#Technical Details

#Applications

#Impact

Citation

H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

H3BERTa

#Key Features

#Technical Details

#Applications

#Impact

Citation

H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact