A paired heavy/light antibody language model that fine-tunes ESM-2 and ESM-C with CDR-preferential masking, yielding zero-shot embeddings for binding-affinity prediction.
This work, from Talaei and colleagues at Boston University (with corresponding author Diane Joseph-McCarthy, and funded by Merck Research Laboratories), presents a pretrained paired antibody language model designed to produce representations tailored to antibody binding. The preprint was posted to bioRxiv on October 31, 2025, with a revised version on May 5, 2026. The paper does not assign an official short name to the model; "CDR-Masked Paired Antibody Language Model" is a descriptive label used here.
The central problem the model addresses is that general protein language models, while powerful, are trained on broad natural-protein corpora and do not capture the distinctive sequence statistics of paired antibody heavy and light chains, nor do they emphasize the complementarity-determining regions (CDRs) that dominate antigen recognition. Rather than training from scratch, the authors fine-tune two established backbones — ESM-2 (3B parameters) and ESM-C (600M parameters) — on a large corpus of paired antibody sequences, using a masking strategy that preferentially masks CDR positions to focus learning on the loops most relevant to binding.
A key design choice is that the resulting embeddings are applied zero-shot from a fixed checkpoint: the antibody language model is not further fine-tuned on any binding-affinity labels. Instead, its frozen representations are fed to downstream affinity-prediction tasks, testing whether CDR-aware pretraining alone produces features that transfer to quantitative binding prediction across diverse antigens.
The approach adapts transformer-based protein language models to the paired-antibody setting. Two backbones are fine-tuned: ESM-2 at 3 billion parameters and ESM-C at 600 million parameters. Pretraining uses a masked-language-modeling objective over a corpus exceeding 1.6 million paired heavy/light antibody sequences, with a CDR-preferential masking scheme that increases the probability of masking residues within the complementarity-determining regions relative to framework positions. After pretraining, a single fixed checkpoint is used to extract embeddings, which are applied without further fine-tuning to binding-affinity prediction. The evaluation spans six antigens and over 90,000 sequence variants, where the CDR-masked paired embeddings outperform antibody-specific baseline models by margins reported up to 27%. The preprint is released under an all-rights-reserved license (cc_no), and neither version provides a public link to code or model weights, so the model is not currently available for download.
The model targets antibody discovery and optimization workflows, where ranking candidate variants by predicted binding affinity can prioritize which sequences advance to experimental characterization. Because the affinity-prediction setting is zero-shot — using frozen embeddings rather than a model fine-tuned on each target — the representations are intended to generalize across antigens, making them useful when antigen-specific binding data are limited. Therapeutic antibody engineering teams, including those in industry settings such as the Merck-funded program behind this work, are the primary intended beneficiaries, alongside academic groups studying sequence determinants of antibody affinity.
This study contributes to a growing line of antibody-specific language models by combining paired-chain training with CDR-focused masking and demonstrating that the resulting frozen embeddings can improve zero-shot binding-affinity prediction over antibody baselines. Its practical reach is currently constrained: the preprint is under an all-rights-reserved license, no official model name is given, and no public code or weights are released in either version, so independent reproduction and adoption are limited pending a release. The reported gains nonetheless add evidence that masking strategy and chain pairing — not just model scale — meaningfully shape how well antibody language models capture the features underlying binding.
Talaei, M., et al. (2026) Preferential CDR masking in paired antibody language models improves binding affinity prediction. bioRxiv.
DOI: 10.1101/2025.10.31.685149