A 110M-parameter multimodal RNA language model that designs RNA sequences from secondary structure, consensus, and Gene Ontology constraints via discrete diffusion.
yakRNA Design is a generative RNA language model that composes new RNA sequences from semantic, structural, and functional specifications rather than from sequence context alone. Where most RNA foundation models are trained primarily to understand sequences—predicting structure, stability, or variant effects—yakRNA is built to write them: a researcher supplies a target secondary structure, a consensus motif, and/or a description of desired function, and the model samples sequences that satisfy those constraints simultaneously. This positions it alongside structure-to-sequence inverse-folding tools, but with the added ability to condition on Gene Ontology (GO) terms as a proxy for biological function.
The model was developed at Stanford University and released as a bioRxiv preprint in April 2026 under the title "yakRNA Design: A semantic multimodal RNA composer." It is distributed as a single 110M-parameter checkpoint with an inference-only code repository, so users download pretrained weights and run conditional generation without retraining. The "semantic multimodal" framing refers to the model's joint conditioning interface, which mixes natural-language-style functional labels (GO terms), structural notation (dot-bracket), and sequence-level constraints in a single generation call.
The headline validation comes from a wet-lab design campaign for frameshift-stimulating RNA elements, structured RNAs that program ribosomal −1 frameshifting. Designing 84 candidates in a zero-shot setting, the authors report 17 experimentally active elements, including at least one design with no detectable identity to any known sequence in the searched universe—evidence that the model generates genuinely novel functional RNA rather than memorized variants of training examples.
yakRNA Design is a 110M-parameter model built on a ModernBERT transformer backbone adapted for discrete diffusion over RNA token sequences, supporting designs up to 636 nucleotides. It was trained on the full Rfam database of structured RNA families, giving it broad coverage of non-coding RNA structural and functional space, and its GO-term conditioning vocabulary spans 280 functional categories. The released artifacts include the pretrained yakRNA_110M.pt checkpoint on HuggingFace and a command-line generator that accepts YAML configuration plus structure, consensus, and GO-term arguments and emits FASTA output. The model's central empirical result is the frameshift-stimulating RNA campaign, in which 17 of 84 zero-shot designs were experimentally active. The repository is inference-only and does not document training hyperparameters; detailed evaluation and ablations are reported in the preprint.
yakRNA Design targets RNA engineering tasks where a researcher knows what structure or function they want but not which sequence to use. Concrete uses include designing frameshift-stimulating elements and other structured regulatory RNAs, scaffolding novel sequences around conserved motifs via infilling, and proposing function-specified candidates for synthetic biology, riboswitch and aptamer engineering, and mRNA element design. Because it conditions on GO terms, it is well suited to exploratory design where the goal is a functional class rather than a precise structure, and the inference-only distribution lowers the barrier for experimental labs to generate candidate libraries.
yakRNA Design contributes to a shift in RNA AI from understanding-focused foundation models toward generative, function-conditioned design, paralleling earlier transitions in protein modeling from structure prediction to de novo design. Its most consequential result is experimental: zero-shot designs that are active in the lab, including a functional element with no recognizable relative in the searched sequence universe, supporting the claim that the model extrapolates beyond its training distribution. Adoption signals remain early given the recent preprint, and the inference-only release, sparse model card, and absence of a formal data card or published training recipe currently limit reproducibility and independent benchmarking.