Overview

GenerRNA is a Transformer-based generative language model developed by Preferred Networks, Inc. for de novo RNA sequence design. Published in PLOS ONE in 2024, it is among the first large language model systems applied specifically to RNA generation, extending the paradigm established by protein language models into the RNA domain. The model allows researchers to produce entirely novel RNA sequences without requiring predefined secondary structures or template sequences as input.

The model was pre-trained on approximately 16.09 million deduplicated RNA sequences drawn from the RNAcentral database, covering over 2,600 Rfam families and 30 RNA types — excluding mRNA to focus on non-coding and functional RNA classes. This broad coverage gives GenerRNA an understanding of the diverse sequence-structure relationships present across the RNA sequence space. Generated sequences are structurally distinct from natural sequences while retaining comparable thermodynamic stability, with mean minimum free energy (MFE) values of -174.7 kcal/mol versus -177.9 kcal/mol for natural sequences (p = 0.811 by Wilcoxon test), and approximately 70% of generated sequences show no identical alignment to any known sequence in public databases.

Key Features

Zero-shot de novo generation: Produces structurally stable, novel RNA sequences from scratch without requiring template sequences, secondary structure constraints, or experimental data as input.
Fine-tuning for targeted design: The pre-trained model can be adapted to specialized tasks on smaller datasets, enabling generation of RNAs with specific binding affinities or functional properties; fine-tuning for protein-binding RNAs achieved affinity scores of 0.872 for ELAVL1 and 0.720 for SRSF1.
High sequence novelty: Around 70% of generated sequences share no identical alignment with any catalogued sequence, confirming genuine exploration of previously uncharted RNA sequence space rather than memorization of training data.
Broad RNA class coverage: Pre-trained on 30 RNA types spanning ribosomal RNAs, transfer RNAs, riboswitches, long non-coding RNAs, and more, providing the model with a comprehensive foundation for diverse downstream design tasks.
BPE tokenization tuned for RNA: A byte-pair encoding (BPE) tokenizer with a vocabulary of 1,024 tokens was trained specifically on RNA sequences to capture recurring RNA motifs and structural patterns efficiently.

Technical Details

GenerRNA uses a decoder-only Transformer architecture with 350 million parameters distributed across 24 layers, a model dimension of 1,280, and a context window of 1,024 tokens (corresponding to roughly 4,000 nucleotides). The architecture follows the standard autoregressive language modeling paradigm: given preceding nucleotide tokens, the model learns to predict the next token, with the BPE tokenizer compressing common RNA subsequences into single vocabulary items. This design mirrors the GPT-style approach that proved effective for protein sequence generation.

Pre-training used release 22 of the RNAcentral database, starting from 34.39 million sequences and reducing to 16.09 million after deduplication, representing 11.6 billion nucleotides in total. Training ran for 12 epochs over approximately 4 days on 16 NVIDIA A100 GPUs. The model is available on HuggingFace Hub and requires a CUDA environment with at least 8 GB of VRAM and PyTorch 2.0 or later. Fine-tuning experiments demonstrated that the pre-trained representations transfer effectively to protein-binding RNA design tasks with comparatively small labeled datasets.

Applications

GenerRNA is applicable wherever researchers need to sample novel RNA sequences for functional screening or therapeutic development. Drug discovery teams can use it to design RNA aptamers and RNA-based inhibitors targeting specific proteins, complementing existing structure-based and experimental selection approaches such as SELEX. Synthetic biologists can leverage fine-tuning to engineer riboswitches, RNA sensors, or regulatory non-coding RNAs with tailored properties. The model's zero-shot generation mode is also useful for exploring the structural diversity of the RNA sequence space — generating candidate libraries for high-throughput screening campaigns without requiring prior knowledge of the target structure.

Impact

GenerRNA establishes a proof of concept that the large-scale generative language modeling paradigm, which has been highly productive in protein design, extends meaningfully to RNA. Its publication demonstrates that a decoder-only Transformer pre-trained on public RNA databases can capture sufficient sequence-structure relationships to generate thermodynamically stable, novel sequences across diverse RNA families. A notable limitation is that the model generates primary sequences without directly optimizing for three-dimensional structure or explicit molecular interactions, meaning functional validation of generated candidates still requires downstream computational structure prediction or wet-lab experimentation. The model's availability on HuggingFace and its demonstrated fine-tuning pathway position it as a practical starting point for groups working on RNA-based therapeutics and synthetic biology tools.

Overview

Key Features

Zero-shot de novo generation: Produces structurally stable, novel RNA sequences from scratch without requiring template sequences, secondary structure constraints, or experimental data as input.

Fine-tuning for targeted design: The pre-trained model can be adapted to specialized tasks on smaller datasets, enabling generation of RNAs with specific binding affinities or functional properties; fine-tuning for protein-binding RNAs achieved affinity scores of 0.872 for ELAVL1 and 0.720 for SRSF1.

High sequence novelty: Around 70% of generated sequences share no identical alignment with any catalogued sequence, confirming genuine exploration of previously uncharted RNA sequence space rather than memorization of training data.

Broad RNA class coverage: Pre-trained on 30 RNA types spanning ribosomal RNAs, transfer RNAs, riboswitches, long non-coding RNAs, and more, providing the model with a comprehensive foundation for diverse downstream design tasks.

BPE tokenization tuned for RNA: A byte-pair encoding (BPE) tokenizer with a vocabulary of 1,024 tokens was trained specifically on RNA sequences to capture recurring RNA motifs and structural patterns efficiently.

Technical Details

Applications

Impact

GenerRNA

Overview

Key Features

Technical Details

Applications

Impact

Citation

GenerRNA: A generative pre-trained language model for de novo RNA design

Metrics

Citations

HuggingFace

Tags

Resources

GenerRNA

Overview

Key Features

Technical Details

Applications

Impact

Citation

GenerRNA: A generative pre-trained language model for de novo RNA design

Metrics

Citations

HuggingFace

Tags

Resources