A retrieval-augmented discrete denoising diffusion model for protein inverse folding that conditions sequence generation on profiles built from structurally similar proteins.
RadDiff addresses protein inverse folding: the task of designing an amino acid sequence that will fold into a given target backbone structure. Inverse folding is a cornerstone of computational protein engineering, underpinning enzyme design, antibody optimization, and the realization of de novo backbones produced by structure-generation tools. While modern methods such as ProteinMPNN, PiFold, and protein-language-model designers have pushed native sequence recovery steadily upward, they typically generate sequences purely from the query structure, leaving the rich evolutionary signal contained in known homologs untapped.
RadDiff, introduced in late 2025 by Jin Han, Tianfan Fu, and Wu-Jun Li at Nanjing University's National Key Laboratory for Novel Software Technology, reframes inverse folding as a retrieval-augmented generation problem. For each target backbone it retrieves structurally similar proteins from large databases, aligns them residue by residue to build a position-specific amino acid profile, and uses that profile as an evolutionary-informed prior to condition a discrete denoising diffusion process. This mirrors the broader retrieval-augmented generation trend in language modeling, transplanting it into structure-based protein design so the generator can lean on observed sequence diversity rather than memorizing it in weights.
The result is a method that the authors report improves sequence recovery by up to 19% over prior approaches across standard benchmarks, while producing highly foldable sequences and scaling gracefully as the retrieval database grows.
RadDiff couples its diffusion generator to a structure encoder built on an equivariant graph neural network (a 6-layer EGNN with hidden dimension 128 and global context vectors), alongside an invariant point attention module for the masked sequence designer. The retrieval stage filters candidates with Foldseek (sequence-identity threshold) and US-align (TM-score > 0.5), then aligns retained hits residue by residue to form the conditioning profile. On standard inverse folding benchmarks the authors report native sequence recovery of roughly 67% on CATH v4.2, about 72% on CATH v4.3, 75.6% on TS50, and 76.2% on PDB2022 — consistently ahead of GNN-based baselines (ProteinMPNN, PiFold, GVP, AlphaDesign), protein-language-model designers (LM-Design, KW-Design), and prior diffusion methods (GraDe-IF, MapDiff), with relative gains up to 19%.
RadDiff targets practitioners who need to design sequences for a fixed target fold: enzyme engineers seeking thermostable or activity-tuned variants, antibody and binder designers, and researchers redesigning sequences for de novo backbones generated by structure-design pipelines. Its retrieval-augmented formulation is particularly attractive when close structural homologs exist in the PDB, since the model can directly exploit that evolutionary context to propose foldable, higher-recovery sequences.
RadDiff demonstrates that retrieval augmentation — already transformative in language modeling — translates effectively to structure-based protein design, offering a complementary path to ever-larger end-to-end models by externalizing knowledge into a searchable database. The reported recovery gains across CATH, TS50, and PDB2022 are substantial for a maturing benchmark suite. At the preprint stage, however, no model weights or license had been released and only partial code was provided, with the authors stating that a full open-source implementation will follow upon publication; until then, independent reproduction and downstream adoption remain limited.
Han, J., et al. (2025) RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding. arXiv.org.
DOI: 10.48550/arXiv.2512.00126Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data