Prescient Design / Genentech
Discrete generative model for antibody protein sequences combining MCMC walks on a smoothed energy landscape with one-step denoising jumps.
Generating novel protein sequences using discrete generative models has long posed a fundamental technical challenge. Continuous generative frameworks — diffusion models, variational autoencoders, flow-based methods — rely on smooth interpolation in a continuous latent space, but amino acid sequences are inherently discrete objects: a single substitution can catastrophically alter function, and there is no natural notion of a gradient on the categorical sequence space. Existing workarounds, such as embedding sequences into continuous space before generating and then rounding back to discrete tokens, introduce systematic artifacts and fail to correctly model the combinatorial structure of sequence diversity. Walk-Jump Sampling (WJS), developed by researchers at Prescient Design (Genentech) and collaborators, provides an elegant and principled solution to this problem by separating the sampling process into two complementary operations that can each be performed efficiently and stably.
The Discrete Walk-Jump Sampling formalism, introduced in a paper accepted at ICLR 2024 as an Oral Presentation and recipient of the Outstanding Paper Award (one of five awarded from more than 7,300 submissions), resolves the discrete generation problem by operating on a smoothed version of the data manifold. Rather than attempting to generate sequences directly in discrete space, the method learns a smoothed energy function on a continuous relaxation of the discrete sequence space, performs Langevin MCMC sampling ("walk" steps) on this smooth manifold to efficiently explore diverse sequence configurations, and then projects each sample back to the true discrete data manifold with a single denoising step (the "jump"). This formulation combines the stable training dynamics of energy-based models with the generation quality of score-based denoising models, while critically requiring only a single noise level rather than the multi-step noise schedules that make standard diffusion models complex to tune and sample from.
The method was validated on the task of antibody sequence generation — one of the most scientifically and therapeutically important discrete sequence generation problems — and demonstrated extraordinary wet-lab success rates that far exceed what has been reported for comparable computational protein design methods. The work was authored by Nathan C. Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hötzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijević, and Saeed Saremi, representing a collaboration between Prescient Design and New York University.
The Walk-Jump Sampling framework operates on one-hot encoded antibody sequences aligned to a standardized scheme. Antibody sequences from the Observed Antibody Space (OAS) database are aligned using the AHo numbering scheme via the ANARCI package and encoded as one-hot vectors of length d = L = 297 positions, covering the full paired heavy and light chain sequence space. The smoothed energy function is parameterized by a neural network denoiser trained with a contrastive divergence-inspired objective that simultaneously trains the energy landscape and the denoising function in a unified framework.
The energy-based model component learns to assign low energy to natural antibody sequences and high energy to corrupted versions. The score-based denoiser learns to map noisy continuous sequences back to the nearest discrete sequence on the natural data manifold. During inference, MCMC walk steps are taken by computing the gradient of the smoothed energy function with respect to the continuous representation and following a Langevin dynamics update, which incorporates both a gradient term and a stochastic noise term. After a predetermined number of walk steps, a single denoising jump projects the continuous sample to a discrete sequence. The critical noise level σ_c ≈ 0.5 was identified both theoretically and empirically: at this noise level, the smoothed energy landscape retains the multimodal structure of the true discrete distribution while being sufficiently smooth for stable gradient-based sampling.
Experimental validation was conducted on antibody binding to the SARS-CoV-2 receptor binding domain (RBD) as a benchmark target. Sequences were scored using the distributional conformity score before selection for laboratory testing, providing a computational filter that prioritizes sequences that are not only individually plausible but that collectively represent the distribution of functional antibodies. The resulting wet-lab success rate — 70% of designs showing binding equal or better than reference antibodies in the first experimental round — substantially exceeds what is typically reported for computational antibody design methods, where multiple iterative rounds of design and selection are usually required to reach comparable success rates.
Walk-Jump Sampling is directly applicable to therapeutic antibody discovery and optimization. The method's ability to generate diverse, expressible antibodies with high functional success rates in the first experimental round makes it well-suited for early-stage drug discovery campaigns where the goal is to rapidly generate a diverse panel of functional leads against a new target. Beyond antibodies, the underlying Discrete WJS formalism is general to any discrete sequence generation problem and has been extended to other protein families and molecular design tasks. The fast-mixing MCMC property is particularly valuable for applications where broad coverage of sequence space is needed — for example, generating diverse antibody libraries for developability screening or exploring the functional neighborhood of a therapeutic lead without restricting search to local mutation space. The method's single-noise-level simplicity also makes it accessible for groups without the computational resources to tune complex multi-step diffusion schedules, positioning WJS as a practical tool for protein engineering labs that have access to standard GPU hardware.
Walk-Jump Sampling's ICLR 2024 Outstanding Paper Award recognition reflects the significance of its methodological contribution: providing the first principled, practically effective framework for discrete protein sequence generation with rigorous theoretical grounding. The extraordinary wet-lab validation results — published alongside the theoretical framework — established a new benchmark for what constitutes a credible demonstration of computational protein design, moving beyond in silico metrics toward direct experimental confirmation of generated sequence quality. The method has influenced subsequent work on discrete generative models for biological sequences, and the MCMC mixing result in particular has opened new research directions in understanding the topology of protein sequence space. A notable limitation is that the current implementation is focused on antibody sequences with fixed alignment and length, and extension to variable-length or structurally diverse protein families requires adaptation of the alignment and encoding scheme. Additionally, while the method generates sequences with high expression rates, prediction of specific binding affinities in advance of experimental testing remains an open challenge.