University of Washington / Amazon Web Services
Discrete diffusion model for conditional antibody sequence generation that restricts learning to somatic variation via a germline-absorbing noising process.
Antibodies are generated by the adaptive immune system through V(D)J recombination of germline gene segments followed by somatic hypermutation, a process that introduces targeted variation onto a relatively conserved germline scaffold. Generative models for antibody sequences must therefore capture two very different kinds of variability: the combinatorial diversity of germline recombination and the comparatively subtle, functionally consequential somatic mutations that drive affinity maturation. MochiDiff, developed by researchers at the University of Washington and Amazon Web Services, is a discrete diffusion model for conditional antibody sequence generation that addresses this challenge by aligning the model's learning objective with the biology of antibody development.
The central innovation of MochiDiff is a biologically motivated "germline-absorbing" noising process. In standard discrete (absorbing-state) diffusion, tokens are progressively corrupted toward a generic mask state, forcing the model to relearn the entire sequence — including germline-encoded positions that are not the interesting degrees of freedom in antibody design. MochiDiff instead corrupts sequences toward their inferred germline, so that the forward (noising) process effectively reverses somatic hypermutation and the learned reverse process is concentrated on modeling somatic variation rather than germline recombination. This focuses model capacity on the variation that distinguishes one mature antibody from another.
MochiDiff is built by fine-tuning the ESM-2 650M protein language model on a large, deduplicated corpus of natural antibody sequences. The model was released as a preprint (arXiv:2605.06720) submitted on 7 May 2026.
MochiDiff fine-tunes the 650M-parameter ESM-2 transformer as the denoising network for a discrete diffusion process. Rather than the standard absorbing-state formulation in which tokens decay to a mask symbol, the forward process is parameterized to absorb each position toward its germline residue, so that fully noised sequences approximate germline configurations and the reverse process learns to introduce somatic mutations. Training data were assembled from the Observed Antibody Space: 337M sequences were clustered to yield a non-redundant set of 25.6M diverse antibody sequences used for fine-tuning. At inference, the model supports classifier-guided conditional sampling from a fixed checkpoint, so additional generation constraints can be imposed without retraining. On held-out perplexity — a standard intrinsic measure of how well a model captures the antibody sequence distribution — MochiDiff reaches values as low as 1.293, outperforming AbLang-2 (1.875) and IgLM (1.411).
MochiDiff is aimed at therapeutic antibody discovery and engineering, where the goal is to generate or optimize sequences that resemble naturally matured antibodies. By concentrating generation on somatic variation, the model is well suited to tasks such as proposing diverse complementarity-determining region (CDR) variants on a fixed germline framework, library design, and humanness-aware sequence generation. The classifier-guided conditional generation makes it possible to steer sampling toward desired properties from a single checkpoint, which is convenient for antibody engineering teams that need to apply multiple objectives without training separate models. As a sequence-level model, MochiDiff complements downstream structure prediction and developability filtering rather than replacing experimental validation.
MochiDiff contributes to a growing line of work that adapts protein language models and discrete diffusion to the specific structure of antibody repertoires, and its germline-absorbing noising process is a notable example of encoding immunological prior knowledge directly into a generative training objective rather than relying on generic corruption schemes. The reported perplexity improvements over AbLang-2 and IgLM suggest that this inductive bias helps the model capture antibody sequence distributions more faithfully. As a recent preprint, MochiDiff has not yet accumulated independent benchmarking or wet-lab validation, and the results should be read as such. At the time of writing the authors do not provide a public link to code or model weights, and the license status of any released model artifact is unconfirmed; the paper text is distributed under CC BY-NC-SA 4.0. These factors currently limit external reproducibility and adoption.