CDR-Masked Paired Antibody Language Model

Paired heavy/light antibody language model fine-tuning ESM-2 and ESM-C with CDR-preferential masking for zero-shot binding affinity embeddings.

Released: October 2025

This work, from Talaei and colleagues at Boston University (with corresponding author Diane Joseph-McCarthy, and funded by Merck Research Laboratories), presents a pretrained paired antibody language model designed to produce representations tailored to antibody binding. The preprint was posted to bioRxiv on October 31, 2025, with a revised version on May 5, 2026. The paper does not assign an official short name to the model; "CDR-Masked Paired Antibody Language Model" is a descriptive label used here.

The central problem the model addresses is that general protein language models, while powerful, are trained on broad natural-protein corpora and do not capture the distinctive sequence statistics of paired antibody heavy and light chains, nor do they emphasize the complementarity-determining regions (CDRs) that dominate antigen recognition. Rather than training from scratch, the authors fine-tune two established backbones — ESM-2 (3B parameters) and ESM-C (600M parameters) — on a large corpus of paired antibody sequences, using a masking strategy that preferentially masks CDR positions to focus learning on the loops most relevant to binding.

A key design choice is that the resulting embeddings are applied zero-shot from a fixed checkpoint: the antibody language model is not further fine-tuned on any binding-affinity labels. Instead, its frozen representations are fed to downstream affinity-prediction tasks, testing whether CDR-aware pretraining alone produces features that transfer to quantitative binding prediction across diverse antigens.

Key Features

Paired-chain pretraining: Trained on more than 1.6 million paired heavy/light antibody sequences, so representations reflect the joint heavy-light context that determines an antibody's paratope rather than treating chains in isolation.
CDR-preferential masking: The masked-language-modeling objective biases masking toward CDR positions, concentrating the model's predictive capacity on the hypervariable loops that drive antigen specificity and affinity.
Fine-tunes strong backbones: Builds on ESM-2 (3B) and ESM-C (600M) rather than training de novo, inheriting general protein knowledge while specializing it for antibodies.
Zero-shot, fixed-checkpoint embeddings: Downstream binding-affinity prediction uses frozen embeddings with no task-specific fine-tuning of the language model, isolating the value of the pretraining strategy itself.
Broad affinity evaluation: Assessed across six antigens and more than 90,000 variants, reporting improvements of up to 27% over antibody-specific baselines on affinity prediction.

Technical Details

The approach adapts transformer-based protein language models to the paired-antibody setting. Two backbones are fine-tuned: ESM-2 at 3 billion parameters and ESM-C at 600 million parameters. Pretraining uses a masked-language-modeling objective over a corpus exceeding 1.6 million paired heavy/light antibody sequences, with a CDR-preferential masking scheme that increases the probability of masking residues within the complementarity-determining regions relative to framework positions. After pretraining, a single fixed checkpoint is used to extract embeddings, which are applied without further fine-tuning to binding-affinity prediction. The evaluation spans six antigens and over 90,000 sequence variants, where the CDR-masked paired embeddings outperform antibody-specific baseline models by margins reported up to 27%. The preprint is released under an all-rights-reserved license (cc_no), and neither version provides a public link to code or model weights, so the model is not currently available for download.

Applications

The model targets antibody discovery and optimization workflows, where ranking candidate variants by predicted binding affinity can prioritize which sequences advance to experimental characterization. Because the affinity-prediction setting is zero-shot — using frozen embeddings rather than a model fine-tuned on each target — the representations are intended to generalize across antigens, making them useful when antigen-specific binding data are limited. Therapeutic antibody engineering teams, including those in industry settings such as the Merck-funded program behind this work, are the primary intended beneficiaries, alongside academic groups studying sequence determinants of antibody affinity.

Impact

This study contributes to a growing line of antibody-specific language models by combining paired-chain training with CDR-focused masking and demonstrating that the resulting frozen embeddings can improve zero-shot binding-affinity prediction over antibody baselines. Its practical reach is currently constrained: the preprint is under an all-rights-reserved license, no official model name is given, and no public code or weights are released in either version, so independent reproduction and adoption are limited pending a release. The reported gains nonetheless add evidence that masking strategy and chain pairing — not just model scale — meaningfully shape how well antibody language models capture the features underlying binding.

Citation

Preferential CDR masking in paired antibody language models improves binding affinity prediction

Preprint

Talaei, M., et al. (2026) Preferential CDR masking in paired antibody language models improves binding affinity prediction. bioRxiv.

DOI: 10.1101/2025.10.31.685149

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References44

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

8Closed

Usability — can I run it?6

Reproducibility — can I retrain it?9

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Paired-chain pretraining: Trained on more than 1.6 million paired heavy/light antibody sequences, so representations reflect the joint heavy-light context that determines an antibody's paratope rather than treating chains in isolation.

CDR-preferential masking: The masked-language-modeling objective biases masking toward CDR positions, concentrating the model's predictive capacity on the hypervariable loops that drive antigen specificity and affinity.

Fine-tunes strong backbones: Builds on ESM-2 (3B) and ESM-C (600M) rather than training de novo, inheriting general protein knowledge while specializing it for antibodies.

Zero-shot, fixed-checkpoint embeddings: Downstream binding-affinity prediction uses frozen embeddings with no task-specific fine-tuning of the language model, isolating the value of the pretraining strategy itself.

Broad affinity evaluation: Assessed across six antigens and more than 90,000 variants, reporting improvements of up to 27% over antibody-specific baselines on affinity prediction.

Technical Details

Applications

Impact

CDR-Masked Paired Antibody Language Model

Key Features

Technical Details

Applications

Impact

Citation

Preferential CDR masking in paired antibody language models improves binding affinity prediction

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CDR-Masked Paired Antibody Language Model

Key Features

Technical Details

Applications

Impact

Citation

Preferential CDR masking in paired antibody language models improves binding affinity prediction

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CDR-Masked Paired Antibody Language Model

#Key Features

#Technical Details

#Applications

#Impact

Citation

Preferential CDR masking in paired antibody language models improves binding affinity prediction

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

CDR-Masked Paired Antibody Language Model

#Key Features

#Technical Details

#Applications

#Impact

Citation

Preferential CDR masking in paired antibody language models improves binding affinity prediction

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact