Transformer-based generative language model for de novo RNA sequence design, pre-trained on 16 million sequences to generate novel, structurally stable RNAs.
GenerRNA is a Transformer-based generative language model developed by Preferred Networks, Inc. for de novo RNA sequence design. Published in PLOS ONE in 2024, it is among the first large language model systems applied specifically to RNA generation, extending the paradigm established by protein language models into the RNA domain. The model allows researchers to produce entirely novel RNA sequences without requiring predefined secondary structures or template sequences as input.
The model was pre-trained on approximately 16.09 million deduplicated RNA sequences drawn from the RNAcentral database, covering over 2,600 Rfam families and 30 RNA types — excluding mRNA to focus on non-coding and functional RNA classes. This broad coverage gives GenerRNA an understanding of the diverse sequence-structure relationships present across the RNA sequence space. Generated sequences are structurally distinct from natural sequences while retaining comparable thermodynamic stability, with mean minimum free energy (MFE) values of -174.7 kcal/mol versus -177.9 kcal/mol for natural sequences (p = 0.811 by Wilcoxon test), and approximately 70% of generated sequences show no identical alignment to any known sequence in public databases.
GenerRNA uses a decoder-only Transformer architecture with 350 million parameters distributed across 24 layers, a model dimension of 1,280, and a context window of 1,024 tokens (corresponding to roughly 4,000 nucleotides). The architecture follows the standard autoregressive language modeling paradigm: given preceding nucleotide tokens, the model learns to predict the next token, with the BPE tokenizer compressing common RNA subsequences into single vocabulary items. This design mirrors the GPT-style approach that proved effective for protein sequence generation.
Pre-training used release 22 of the RNAcentral database, starting from 34.39 million sequences and reducing to 16.09 million after deduplication, representing 11.6 billion nucleotides in total. Training ran for 12 epochs over approximately 4 days on 16 NVIDIA A100 GPUs. The model is available on HuggingFace Hub and requires a CUDA environment with at least 8 GB of VRAM and PyTorch 2.0 or later. Fine-tuning experiments demonstrated that the pre-trained representations transfer effectively to protein-binding RNA design tasks with comparatively small labeled datasets.
GenerRNA is applicable wherever researchers need to sample novel RNA sequences for functional screening or therapeutic development. Drug discovery teams can use it to design RNA aptamers and RNA-based inhibitors targeting specific proteins, complementing existing structure-based and experimental selection approaches such as SELEX. Synthetic biologists can leverage fine-tuning to engineer riboswitches, RNA sensors, or regulatory non-coding RNAs with tailored properties. The model's zero-shot generation mode is also useful for exploring the structural diversity of the RNA sequence space — generating candidate libraries for high-throughput screening campaigns without requiring prior knowledge of the target structure.
GenerRNA establishes a proof of concept that the large-scale generative language modeling paradigm, which has been highly productive in protein design, extends meaningfully to RNA. Its publication demonstrates that a decoder-only Transformer pre-trained on public RNA databases can capture sufficient sequence-structure relationships to generate thermodynamically stable, novel sequences across diverse RNA families. A notable limitation is that the model generates primary sequences without directly optimizing for three-dimensional structure or explicit molecular interactions, meaning functional validation of generated candidates still requires downstream computational structure prediction or wet-lab experimentation. The model's availability on HuggingFace and its demonstrated fine-tuning pathway position it as a practical starting point for groups working on RNA-based therapeutics and synthetic biology tools.
Zhao, Y., Oono, K., Takizawa, H., & Kotera, M. (2024). GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE, 19(10), e0310814.
DOI: 10.1371/journal.pone.0310814