bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
ProteinLanguage model

H3BERTa

University of Bern

A RoBERTa transformer pretrained solely on antibody heavy-chain CDR-H3 sequences, producing embeddings for repertoire analysis, B-cell maturation profiling, and bnAb classification.

Released: November 2025
Parameters: 85.7 Million

H3BERTa is a protein language model trained exclusively on the third complementarity-determining region of the antibody heavy chain (CDR-H3), the loop that contributes most of the diversity and antigen-binding specificity of an antibody. Whereas most antibody language models learn over the entire variable domain, H3BERTa concentrates its modeling capacity on this single hypervariable region, testing the hypothesis that CDR-H3 alone carries enough signal to characterize immune repertoires. It was developed by Chiara Rodella and Thomas Lemmin at the Institute of Biochemistry and Molecular Medicine, University of Bern, and released as a bioRxiv preprint in November 2025.

The model is positioned against two related approaches it explicitly distinguishes itself from: AntiBERTa, which is trained on full variable-region sequences, and the CDR-Masked Paired Antibody Language Model, which fine-tunes the general-purpose ESM-2 protein model. H3BERTa instead pretrains a RoBERTa encoder from scratch on CDR-H3 sequences only, so that its learned representations are shaped entirely by the statistics of this region.

By focusing on CDR-H3, H3BERTa aims to provide a lightweight, region-specific foundation for antibody repertoire analysis — extracting embeddings that reflect germline gene usage and B-cell maturation state, scoring sequences by pseudo-perplexity, and serving as a frozen feature extractor for downstream classification tasks with limited labeled data.

#Key Features

  • CDR-H3-only pretraining: Trained on more than 17 million curated CDR-H3 sequences from healthy-donor repertoires (Observed Antibody Space, IgG/IgA), concentrating the model on the most diverse antibody region rather than the full variable domain.
  • Informative zero-shot embeddings: Without any task-specific fine-tuning, the learned representations capture J-gene usage and patterns associated with B-cell maturation, indicating the model encodes biologically meaningful structure.
  • Pseudo-perplexity profiling: Masked-language-model scoring distinguishes healthy repertoires from HIV-1-derived repertoires, offering a repertoire-level signal of immune state.
  • Few-shot bnAb classification: Frozen H3BERTa embeddings support few-shot classifiers that flag candidate HIV-1 broadly neutralizing antibodies, useful when labeled examples are scarce.
  • Open release: Model weights, training code, and the curated datasets are publicly available on HuggingFace, GitHub, and Zenodo under an MIT license.

#Technical Details

H3BERTa adopts a RoBERTa-base encoder architecture (an encoder-only transformer trained with masked language modeling) with approximately 85.7M parameters, a maximum sequence length of 100 amino acids, and a compact 25-token vocabulary covering the 20 standard amino acids plus special tokens. Pretraining used more than 17 million curated CDR-H3 sequences drawn from healthy-donor repertoires in the Observed Antibody Space (IgG and IgA isotypes). The authors evaluate the model through three lenses: zero-shot embedding analysis (recovering J-gene usage and maturation signals), pseudo-perplexity comparisons separating healthy from HIV-1-derived repertoires, and few-shot classification of broadly neutralizing antibodies built on frozen embeddings with SVM and GAN-BERTa classifiers. The datasets used for training, validation, and testing, along with all trained model weights, are deposited on Zenodo and mirrored from the GitHub repository.

#Applications

H3BERTa is aimed at immunologists and antibody engineers who analyze large B-cell receptor repertoires. Its embeddings can characterize repertoire composition, track germline gene usage, and infer maturation state directly from CDR-H3 sequences, while its pseudo-perplexity scores provide a way to compare healthy and disease-associated repertoires at scale. As a frozen feature extractor, it lowers the labeled-data barrier for downstream tasks such as identifying candidate broadly neutralizing antibodies against HIV-1, making it a practical building block for therapeutic antibody discovery and immune monitoring pipelines.

#Impact

H3BERTa contributes to a growing line of antibody-specific language models by demonstrating that a model trained on CDR-H3 alone can recover biologically meaningful structure — germline usage, maturation, and disease-associated repertoire differences — rather than requiring the full variable region or a general-purpose protein backbone. Its open release of weights, code, and datasets under a permissive license makes it readily reusable for repertoire analysis and antibody-discovery research. As a recent preprint, its broader adoption and benchmark standing relative to AntiBERTa and ESM-2-based approaches remain to be established through independent evaluation, and the model card does not yet report standardized benchmark metrics or an explicit limitations discussion.

Citation

H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Preprint

Rodella, C. & Lemmin, T. (2025) H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis. bioRxiv.

DOI: 10.1101/2025.11.03.686198

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References51

GitHub

Stars1
Forks1
Open Issues0
Contributors2
Last Push6d ago
LanguageJupyter Notebook

HuggingFace

Downloads67
Likes0
Last Modified7mo ago
Pipelinefill-mask

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
83Open
Usability — can I run it?92
Reproducibility — can I retrain it?87
Model Openness Framework
Unclassified
Restrictive license on core components

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset