Zero-shot antibody affinity maturation using ESM pseudolikelihood scoring. Improves binding up to 160-fold with no antigen-specific training data.
This method, developed by Brian Hie, Peter Kim, and colleagues at Stanford University and published in Nature Biotechnology in April 2023, demonstrates that general-purpose protein language models can guide efficient antibody affinity maturation without any antigen-specific training data. Rather than training a task-specific model on binding measurements — which requires a substantial initial dataset and significant experimental investment — the approach repurposes the evolutionary knowledge already encoded in large sequence-based language models to identify mutations that are biologically plausible and likely to improve function.
The core insight is that protein language models, trained on tens of millions of natural protein sequences, implicitly learn what substitutions are evolutionarily tolerated at each position. By scoring candidate mutations using the pseudolikelihood of each amino acid given its sequence context, the method identifies substitutions that nature has already vetted across evolutionary time. This strategy sidesteps the cold-start problem inherent in supervised machine learning-guided directed evolution: it requires nothing more than the wild-type sequence to generate mutation recommendations.
Experimental validation across seven therapeutically relevant antibodies — including those targeting influenza hemagglutinin, Ebola glycoprotein, and SARS-CoV-2 receptor-binding domain — demonstrated that the approach can substantially improve binding affinity while simultaneously maintaining thermostability and, in most cases, improving viral neutralization potency. Critically, the entire laboratory evolution campaign required screening no more than 20 variants per antibody across just two rounds of testing, making the workflow practical for resource-constrained discovery programs.
The method uses ESM-1b (650 million parameters, trained on UniRef50 with approximately 27 million sequences) and an ensemble of five ESM-1v models (each 650 million parameters, trained on UniRef90 with approximately 98 million sequences) as the scoring backbone. For a given wild-type sequence, the pipeline computes the log-likelihood ratio of each possible single-site substitution relative to the wild-type residue, using masked language modeling in the style of BERT. A substitution is recommended only if it exceeds a likelihood ratio threshold alpha and is among the top-scoring changes at that position in at least k of the language models, where k is a tunable stringency parameter.
This consensus strategy outperformed 47 alternative variant-effect predictors on a standardized benchmark and consistently exceeded antibody-specific models such as AbLang and Sapiens, which despite being trained exclusively on antibody sequences failed to match the evolutionary signal captured by general protein models. The authors attribute this counterintuitive result to the much larger and more diverse training corpora of general models, which encode richer epistatic information. The method is also agnostic to antigen identity, antibody class, or target indication, and has been validated across eight diverse protein families beyond antibodies, including beta-lactamase and influenza hemagglutinin.
The primary application is antibody affinity maturation during therapeutic development, where the method can be deployed after initial hit identification to improve binding potency without a large experimental dataset. It is particularly valuable for targets where antigen-specific training data are scarce, such as emerging pathogens or novel disease targets. The pipeline is also applicable to the optimization of unmatured germline antibodies, which may retain desirable breadth properties but require affinity improvement for clinical utility. More broadly, the consensus pseudolikelihood scoring framework can be applied to any protein engineering campaign where the goal is to improve an existing function without dramatically altering evolutionary character — including enzyme optimization, cytokine engineering, and the improvement of biosimilar candidates.
The study was influential in establishing that general protein language models, without antibody-specific fine-tuning, provide competitive or superior guidance for antibody engineering compared to domain-specific models. It has contributed to a broader shift in the field toward zero-shot and few-shot approaches to protein design, reducing the reliance on large labeled datasets and making machine-learning-guided engineering accessible to smaller laboratories. The paper's emphasis on experimental efficiency — achieving meaningful improvements through minimal screening — directly addresses a practical bottleneck in therapeutic antibody development. Key limitations include the restriction to function-improving rather than function-switching mutations, reduced effectiveness when wild-type sequences already occupy fitness peaks, and the known challenges of generalizing to mutations far outside natural sequence distributions. The experimental code and data are openly available under an MIT license, facilitating adoption across academic and industrial settings.
Hie, B. L., et al. (2023) Efficient evolution of human antibodies from general protein language models. Nature Biotechnology.
DOI: 10.1038/s41587-023-01763-2