bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

MochiDiff

University of Washington / Amazon Web Services

Discrete diffusion model for conditional antibody sequence generation that restricts learning to somatic variation via a germline-absorbing noising process.

Released: May 2026
Parameters: 650 Million

Antibodies are generated by the adaptive immune system through V(D)J recombination of germline gene segments followed by somatic hypermutation, a process that introduces targeted variation onto a relatively conserved germline scaffold. Generative models for antibody sequences must therefore capture two very different kinds of variability: the combinatorial diversity of germline recombination and the comparatively subtle, functionally consequential somatic mutations that drive affinity maturation. MochiDiff, developed by researchers at the University of Washington and Amazon Web Services, is a discrete diffusion model for conditional antibody sequence generation that addresses this challenge by aligning the model's learning objective with the biology of antibody development.

The central innovation of MochiDiff is a biologically motivated "germline-absorbing" noising process. In standard discrete (absorbing-state) diffusion, tokens are progressively corrupted toward a generic mask state, forcing the model to relearn the entire sequence — including germline-encoded positions that are not the interesting degrees of freedom in antibody design. MochiDiff instead corrupts sequences toward their inferred germline, so that the forward (noising) process effectively reverses somatic hypermutation and the learned reverse process is concentrated on modeling somatic variation rather than germline recombination. This focuses model capacity on the variation that distinguishes one mature antibody from another.

MochiDiff is built by fine-tuning the ESM-2 650M protein language model on a large, deduplicated corpus of natural antibody sequences. The model was released as a preprint (arXiv:2605.06720) submitted on 7 May 2026.

#Key Features

  • Germline-absorbing diffusion: A custom discrete noising process corrupts antibody sequences toward their germline rather than a generic mask token, restricting what the reverse process must learn to somatic variation and aligning the generative objective with antibody maturation biology.
  • Classifier-guided conditional generation: A single trained checkpoint supports conditional generation at inference time through classifier guidance, allowing different generation objectives to be specified without retraining the diffusion model.
  • Protein language model backbone: The model is initialized from ESM-2 (650M parameters), inheriting representations learned from large-scale protein pretraining and adapting them to the antibody domain.
  • Diverse, deduplicated training corpus: Training uses 25.6M diverse antibody sequences obtained by clustering 337M sequences from the Observed Antibody Space (OAS), reducing redundancy and over-representation of common clonotypes.
  • Strong sequence-modeling performance: MochiDiff achieves lower perplexity than established antibody language models, reaching a perplexity as low as 1.293 versus 1.875 for AbLang-2 and 1.411 for IgLM.

#Technical Details

MochiDiff fine-tunes the 650M-parameter ESM-2 transformer as the denoising network for a discrete diffusion process. Rather than the standard absorbing-state formulation in which tokens decay to a mask symbol, the forward process is parameterized to absorb each position toward its germline residue, so that fully noised sequences approximate germline configurations and the reverse process learns to introduce somatic mutations. Training data were assembled from the Observed Antibody Space: 337M sequences were clustered to yield a non-redundant set of 25.6M diverse antibody sequences used for fine-tuning. At inference, the model supports classifier-guided conditional sampling from a fixed checkpoint, so additional generation constraints can be imposed without retraining. On held-out perplexity — a standard intrinsic measure of how well a model captures the antibody sequence distribution — MochiDiff reaches values as low as 1.293, outperforming AbLang-2 (1.875) and IgLM (1.411).

#Applications

MochiDiff is aimed at therapeutic antibody discovery and engineering, where the goal is to generate or optimize sequences that resemble naturally matured antibodies. By concentrating generation on somatic variation, the model is well suited to tasks such as proposing diverse complementarity-determining region (CDR) variants on a fixed germline framework, library design, and humanness-aware sequence generation. The classifier-guided conditional generation makes it possible to steer sampling toward desired properties from a single checkpoint, which is convenient for antibody engineering teams that need to apply multiple objectives without training separate models. As a sequence-level model, MochiDiff complements downstream structure prediction and developability filtering rather than replacing experimental validation.

#Impact

MochiDiff contributes to a growing line of work that adapts protein language models and discrete diffusion to the specific structure of antibody repertoires, and its germline-absorbing noising process is a notable example of encoding immunological prior knowledge directly into a generative training objective rather than relying on generic corruption schemes. The reported perplexity improvements over AbLang-2 and IgLM suggest that this inductive bias helps the model capture antibody sequence distributions more faithfully. As a recent preprint, MochiDiff has not yet accumulated independent benchmarking or wet-lab validation, and the results should be read as such. At the time of writing the authors do not provide a public link to code or model weights, and the license status of any released model artifact is unconfirmed; the paper text is distributed under CC BY-NC-SA 4.0. These factors currently limit external reproducibility and adoption.

Citation

Preprint

DOI: 10.48550/arXiv.2605.06720

DOI: 10.48550/arXiv.2605.06720

Openness

Unclassified
Restrictive license on core components

Tags

antibodyantibody_designde_novo_designdiffusiongenerativeprotein_designproteomicsself_supervisedtransformer

Resources

Research Paper