Modular deep-learning framework for 3D-structure-based RNA sequence design, pairing a direct GNN predictor (SCRU-Seq) and a diffusion model (SCRU-Diff) built on self-contained RNA units.
Designing an RNA sequence that will fold into a target three-dimensional shape — the "inverse folding" problem — is a central challenge in synthetic biology and RNA therapeutics. Progress has been bottlenecked by the scarcity of high-resolution 3D RNA structures, which leaves data-hungry deep-learning models prone to overfitting, and by the computational cost of leading methods that rely on autoregressive or iterative sampling over whole molecules. SCRU-Seq and SCRU-Diff, introduced by Jian Wang and Nikolay V. Dokholyan at the University of Virginia School of Medicine in a 2026 bioRxiv preprint, attack both problems with a modular strategy that designs RNA from reusable structural building blocks rather than treating each molecule as monolithic.
The work first decomposes complex RNAs into Self-Contained RNA Units (SCRUs): structurally autonomous modules identified through tertiary-contact clustering, each of which behaves as a self-stabilizing, foldable physical unit. Assembling these units into the SCRU-DB database yields more than 61,000 SCRUs spanning over 8,200 unique structural clusters — a library substantially larger than prior RNA motif collections, which dramatically expands the effective training signal available from a limited pool of experimental structures.
On top of this data foundation the authors release two complementary, pretrained models with fixed checkpoints. SCRU-Seq is a graph neural network that predicts a sequence directly from a target structure in a single forward pass (O(1) inference), while SCRU-Diff is a diffusion model that refines sequences iteratively for higher accuracy. Together they let users trade off speed against fidelity within one framework.
SCRU-Seq is a graph neural network that consumes a target 3D RNA backbone and predicts the underlying nucleotide sequence in a single non-autoregressive pass, giving constant-time inference relative to iterative competitors. SCRU-Diff is a generative diffusion model that conditions on the same target structure and denoises toward a sequence over multiple steps, recovering accuracy at the cost of additional compute. Both are trained on the SCRU-DB corpus of 61,000+ self-contained units (8,200+ clusters) and evaluated against established RNA inverse-folding systems such as NA-MPNN and RiboDiffusion. On the curated set112 benchmark, SCRU-Diff attains a Best NSR of 79.2% and SCRU-Seq attains 63.7% NSR, with the modular SCRU representation credited for the gains under limited 3D-structure data.
The framework targets researchers in RNA nanotechnology, synthetic biology, and RNA-based therapeutics who need to engineer sequences that fold into specified 3D conformations — for example designing structured aptamers, ribozymes, riboswitches, or scaffolds for RNA drug development. SCRU-Seq's single-pass speed suits high-throughput screening and large design libraries, while SCRU-Diff's iterative refinement fits cases where maximizing structural fidelity matters more than runtime, letting practitioners choose the appropriate point on the speed-accuracy curve.
By reframing RNA inverse folding around reusable self-contained units, this work offers a route past the field's defining obstacle — the scarcity of experimental 3D RNA structures — and reports state-of-the-art native sequence recovery on set112. The accompanying SCRU-DB database is itself a contribution that could support future RNA modeling efforts beyond sequence design. As of its 2026 preprint release the work has not yet been peer-reviewed, and no public code or model weights have been located, so independent reproduction and adoption remain to be demonstrated.