University of California, Berkeley
A conditional encoder-decoder language model that designs RNA sequences under simultaneous secondary-structure, fixed-base, and coding constraints.
GoForth addresses RNA sequence design as a conditional generative modeling problem: given a target secondary structure, a set of fixed bases, and coding constraints, generate nucleotide sequences that satisfy all of them simultaneously. This is the inverse of the folding problem — rather than predicting how a given RNA folds, the model proposes sequences likely to fold into a desired shape while respecting other user-specified requirements such as preserving a reading frame or pinning specific positions.
The model was developed by Michael Lindsey and released as a preprint (arXiv:2605.07608) in May 2026. (The arXiv record does not list an institutional affiliation; the author is a faculty member in the Department of Mathematics at the University of California, Berkeley, so the organization here is attributed by inference rather than from the paper itself.) Its central design choice is to train a forward encoder-decoder language model directly on witnessed RNA folds rather than distilling from an inverse-design teacher. The method separates three components that are usually entangled: a sequence prior, a forward folding sampler, and a likelihood oracle.
GoForth sits within the growing space of RNA design and foundation models — alongside structure-aware language models such as ERNIE-RNA and RNA-FM — but is specialized for constraint-satisfying generation rather than representation learning. It ships not just as research code but as a self-contained local workbench, lowering the barrier to interactive RNA design.
GoForth is a sequence-to-sequence autoregressive designer built as a PyTorch encoder-decoder language model with condition encoders for the different constraint modalities. Structure is supplied in dot-bracket notation with additional tokens for unknown (?) and paired-unknown (#) positions, while base masking allows concrete nucleotides alongside ambiguity tokens (?, N, #). Two released "small" checkpoints (~41 MB each) cover the main use cases: full_structure_small.pt for full-structure targets and fsb_partial_base_small.pt for partial structures and base constraints; both are distributed as GitHub release assets with SHA256 verification and fetched via scripts/download_checkpoints.sh. The model is trained on observed RNA fold data drawn from ETERNA100v2 and Rfam. Evaluation in the preprint covers full inverse-folding benchmarks and mixed structure/sequence/coding tasks, where the authors report fast, high-quality candidate generation along with learned semantic task embeddings and an emergent notion of design feasibility. Exact parameter counts are not stated in the public documentation.
GoForth is aimed at researchers designing functional RNAs — riboswitches, aptamers, structured untranslated regions, and other elements where a specific fold must coexist with fixed motifs or a preserved coding sequence. Because it accepts coding constraints alongside structure, it is well suited to mRNA and synthetic-biology workflows where a designed sequence must both fold correctly and translate a required protein. The bundled workbench makes it usable by experimentalists without a deep ML background: a user enters constraints, generates candidates, and inspects ViennaRNA-folded structures before committing to synthesis and wet-lab validation.
GoForth contributes a methodological shift for RNA design by showing that a forward language model trained on observed folds — rather than an inverse-design teacher — can generate sequences under combined structure, sequence, and coding constraints. As a recent single-author preprint with open Apache-2.0 code and a ready-to-run local workbench, its long-term adoption and benchmark standing are still emerging and should be read with the usual caveats for unreviewed work. Notable limitations include the absence of published parameter counts, reliance on secondary-structure (ViennaRNA) scoring rather than tertiary or pseudoknot-aware evaluation, and currently only "small" released checkpoints, leaving headroom for larger models and broader benchmarking.