bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

Walk-Jump Sampling

Prescient Design / Genentech

Discrete generative model for antibody protein sequences combining MCMC walks on a smoothed energy landscape with one-step denoising jumps.

Released: 2024

Overview

Generating novel protein sequences using discrete generative models has long posed a fundamental technical challenge. Continuous generative frameworks — diffusion models, variational autoencoders, flow-based methods — rely on smooth interpolation in a continuous latent space, but amino acid sequences are inherently discrete objects: a single substitution can catastrophically alter function, and there is no natural notion of a gradient on the categorical sequence space. Existing workarounds, such as embedding sequences into continuous space before generating and then rounding back to discrete tokens, introduce systematic artifacts and fail to correctly model the combinatorial structure of sequence diversity. Walk-Jump Sampling (WJS), developed by researchers at Prescient Design (Genentech) and collaborators, provides an elegant and principled solution to this problem by separating the sampling process into two complementary operations that can each be performed efficiently and stably.

The Discrete Walk-Jump Sampling formalism, introduced in a paper accepted at ICLR 2024 as an Oral Presentation and recipient of the Outstanding Paper Award (one of five awarded from more than 7,300 submissions), resolves the discrete generation problem by operating on a smoothed version of the data manifold. Rather than attempting to generate sequences directly in discrete space, the method learns a smoothed energy function on a continuous relaxation of the discrete sequence space, performs Langevin MCMC sampling ("walk" steps) on this smooth manifold to efficiently explore diverse sequence configurations, and then projects each sample back to the true discrete data manifold with a single denoising step (the "jump"). This formulation combines the stable training dynamics of energy-based models with the generation quality of score-based denoising models, while critically requiring only a single noise level rather than the multi-step noise schedules that make standard diffusion models complex to tune and sample from.

The method was validated on the task of antibody sequence generation — one of the most scientifically and therapeutically important discrete sequence generation problems — and demonstrated extraordinary wet-lab success rates that far exceed what has been reported for comparable computational protein design methods. The work was authored by Nathan C. Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hötzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijević, and Saeed Saremi, representing a collaboration between Prescient Design and New York University.

Key Features

  • Principled discrete sequence generation: Walk-Jump Sampling provides a theoretically grounded framework for generating discrete protein sequences by separating the sampling process into MCMC exploration on a smoothed manifold and one-step denoising projection, avoiding the artifacts of continuous relaxation approaches.
  • Single noise level simplicity: Unlike standard diffusion models that require learning and sampling across an entire schedule of noise levels, the Discrete WJS framework requires only a single critical noise level (σ_c ≈ 0.5 for antibody sequences), substantially simplifying hyperparameter selection and training.
  • Fast-mixing long-run MCMC chains: The method achieves the first reported demonstration of long-run, fast-mixing Langevin MCMC chains over antibody sequence space, where a single continuous MCMC trajectory visits diverse antibody protein classes without becoming trapped in local minima — a fundamental requirement for exhaustive exploration of sequence diversity.
  • Distributional conformity scoring: A novel metric — the distributional conformity score — is introduced to evaluate how well generated samples populate the true distribution of functional sequences rather than simply evaluating individual sequence quality. This enables the optimization and selection of generated sequences for experimental testing.
  • Exceptional wet-lab validation rate: 97–100% of generated antibody sequences were successfully expressed and purified in bacterial expression systems, and 70% of functional designs showed equal or improved binding affinity compared to known functional antibodies in a single round of laboratory experiments with no iterative optimization.
  • Combinatorial diversity exploration: Because the MCMC walk operates on a smooth energy landscape, the sampler can traverse between distinct antibody classes in a single chain, generating combinatorially diverse sequences that are not accessible by local mutation or interpolation-based methods.

Technical Details

The Walk-Jump Sampling framework operates on one-hot encoded antibody sequences aligned to a standardized scheme. Antibody sequences from the Observed Antibody Space (OAS) database are aligned using the AHo numbering scheme via the ANARCI package and encoded as one-hot vectors of length d = L = 297 positions, covering the full paired heavy and light chain sequence space. The smoothed energy function is parameterized by a neural network denoiser trained with a contrastive divergence-inspired objective that simultaneously trains the energy landscape and the denoising function in a unified framework.

The energy-based model component learns to assign low energy to natural antibody sequences and high energy to corrupted versions. The score-based denoiser learns to map noisy continuous sequences back to the nearest discrete sequence on the natural data manifold. During inference, MCMC walk steps are taken by computing the gradient of the smoothed energy function with respect to the continuous representation and following a Langevin dynamics update, which incorporates both a gradient term and a stochastic noise term. After a predetermined number of walk steps, a single denoising jump projects the continuous sample to a discrete sequence. The critical noise level σ_c ≈ 0.5 was identified both theoretically and empirically: at this noise level, the smoothed energy landscape retains the multimodal structure of the true discrete distribution while being sufficiently smooth for stable gradient-based sampling.

Experimental validation was conducted on antibody binding to the SARS-CoV-2 receptor binding domain (RBD) as a benchmark target. Sequences were scored using the distributional conformity score before selection for laboratory testing, providing a computational filter that prioritizes sequences that are not only individually plausible but that collectively represent the distribution of functional antibodies. The resulting wet-lab success rate — 70% of designs showing binding equal or better than reference antibodies in the first experimental round — substantially exceeds what is typically reported for computational antibody design methods, where multiple iterative rounds of design and selection are usually required to reach comparable success rates.

Applications

Walk-Jump Sampling is directly applicable to therapeutic antibody discovery and optimization. The method's ability to generate diverse, expressible antibodies with high functional success rates in the first experimental round makes it well-suited for early-stage drug discovery campaigns where the goal is to rapidly generate a diverse panel of functional leads against a new target. Beyond antibodies, the underlying Discrete WJS formalism is general to any discrete sequence generation problem and has been extended to other protein families and molecular design tasks. The fast-mixing MCMC property is particularly valuable for applications where broad coverage of sequence space is needed — for example, generating diverse antibody libraries for developability screening or exploring the functional neighborhood of a therapeutic lead without restricting search to local mutation space. The method's single-noise-level simplicity also makes it accessible for groups without the computational resources to tune complex multi-step diffusion schedules, positioning WJS as a practical tool for protein engineering labs that have access to standard GPU hardware.

Impact

Walk-Jump Sampling's ICLR 2024 Outstanding Paper Award recognition reflects the significance of its methodological contribution: providing the first principled, practically effective framework for discrete protein sequence generation with rigorous theoretical grounding. The extraordinary wet-lab validation results — published alongside the theoretical framework — established a new benchmark for what constitutes a credible demonstration of computational protein design, moving beyond in silico metrics toward direct experimental confirmation of generated sequence quality. The method has influenced subsequent work on discrete generative models for biological sequences, and the MCMC mixing result in particular has opened new research directions in understanding the topology of protein sequence space. A notable limitation is that the current implementation is focused on antibody sequences with fixed alignment and length, and extension to variable-length or structurally diverse protein families requires adaptation of the alignment and encoding scheme. Additionally, while the method generates sequences with high expression rates, prediction of specific binding affinities in advance of experimental testing remains an open challenge.

Tags

protein designde novo designantibody designdiffusionenergy-based modelgenerativeself-supervisedantibody

Resources

GitHub RepositoryResearch Paper