genbio.ai
A 1.6-billion-parameter RNA foundation model pretrained on 42 million non-coding RNA sequences, achieving state-of-the-art performance across 24 of 26 RNA understanding benchmarks.
AIDO.RNA is a large-scale RNA foundation model developed by GenBio AI as part of the AIDO (AI-Driven Digital Organism) platform. Released in November 2024, it is a 1.6-billion-parameter encoder-only transformer pretrained on 42 million non-coding RNA sequences sourced from RNAcentral — the most comprehensive database of non-coding RNA sequences spanning all kingdoms of life. The model addresses a persistent gap in the RNA modeling landscape: while protein language models have scaled to billions of parameters and DNA models have followed suit, RNA-specific foundation models have historically operated at smaller scales and narrower task coverage. AIDO.RNA extends the foundation model paradigm to the full complexity of RNA biology, spanning structural prediction, post-transcriptional regulation, genetic regulation, and RNA sequence design.
RNA is biologically distinct from DNA in ways that make dedicated modeling worthwhile. Non-coding RNAs — including microRNAs, long non-coding RNAs, ribosomal RNAs, transfer RNAs, small nuclear RNAs, and many other classes — perform structural, catalytic, and regulatory functions that cannot be inferred from DNA sequence alone. RNA secondary structure, with its base-pairing stems and loops, is determined by thermodynamic constraints that interact with sequence in complex, context-dependent ways. Translational efficiency of messenger RNAs depends on codon usage, 5' and 3' untranslated region structure, and the interplay between these elements. Splicing is governed by intronic and exonic regulatory sequences whose combinatorial logic is far from fully understood. A model pretrained at sufficient scale on diverse RNA sequences has the potential to internalize these various layers of sequence-function grammar and provide a flexible backbone for downstream applications across the breadth of RNA biology.
AIDO.RNA achieves state-of-the-art performance on 24 out of 26 RNA sequence understanding tasks in a comprehensive benchmark suite assembled by the authors, encompassing RNA secondary structure prediction, translation efficiency estimation, splice site classification, ncRNA family classification, RNA modification site detection, and cross-species regulatory tasks. This breadth of performance distinguishes AIDO.RNA from prior single-task RNA models and positions it as a general-purpose backbone for RNA bioinformatics. The model is released with pretrained weights on Hugging Face and fine-tuning code through the AIDO.ModelGenerator framework on GitHub.
AIDO.RNA is an encoder-only transformer based on the BERT architecture, pretrained using a masked language modeling (MLM) objective. The architecture comprises 32 transformer layers with 32 attention heads per layer, a hidden size of 2,048, and a feed-forward hidden size of 5,440. Positional information is encoded using Rotary Position Embeddings (RoPE), and the model employs LayerNorm with SwiGLU activation functions — a combination that reflects current best practices from large language model development. Training was implemented using the Megatron-LM distributed training framework with FlashAttention-2 for efficient attention computation and BFloat16 mixed-precision arithmetic. The MLM masking protocol follows the standard BERT approach: 15% of input nucleotide tokens are selected, of which 80% are replaced with a mask token, 10% with a random nucleotide, and 10% left unchanged.
The training corpus consists of 41.5 million distinct non-coding RNA sequences from RNAcentral, comprising approximately 30 billion unique nucleotides trained for 6 epochs. This corpus spans microRNAs (miRNAs), long non-coding RNAs (lncRNAs), ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), and other ncRNA classes, with representation from bacteria, archaea, plants, fungi, and animals. The breadth of ncRNA class coverage is important: each RNA class has distinct structural motifs and functional logics, and a model trained across all classes develops representations that capture general principles of RNA sequence-structure-function relationships rather than overfitting to the peculiarities of any single class.
On the benchmark suite, selected results illustrate the model's performance profile. For RNA secondary structure prediction, AIDO.RNA achieves an F1 score of 0.787 on the bpRNA-TS0 test set, outperforming RNAErnie and RiNALMo by substantial margins — notable because secondary structure prediction is a core RNA biophysics task. For ncRNA family classification, the model reaches 0.993 accuracy. For RNA modification site prediction, it achieves an average AUROC of 0.971 across modification types. For translation efficiency prediction, the Pearson correlation of 0.560 averaged across seven organisms represents a 42% relative improvement over CaLM. The model is also evaluated on RNA sequence design tasks, demonstrating that fine-tuned variants can generate sequences with specified functional properties. The two tasks on which AIDO.RNA did not achieve top performance remain unreported in detail, but the authors note that the overall benchmark profile reflects a model with unusual generality across RNA biology rather than optimization for any single application.
The HuggingFace release includes both the base 1.6B model (AIDO.RNA-1.6B) and a coding-sequence-specific variant (AIDO.RNA-1.6B-CDS) pretrained with additional emphasis on messenger RNA coding sequences, providing a specialist option for researchers focused on protein-coding transcripts.
AIDO.RNA serves a wide range of researchers working across the expanding landscape of RNA biology and RNA-based therapeutics. Structural biologists and computational chemists can use the model as a sequence encoder to predict RNA secondary structure, identify conserved structural motifs, and prioritize targets for experimental structure determination. RNA therapeutics researchers — developing antisense oligonucleotides, siRNAs, miRNA mimics, or mRNA-based vaccines and therapeutics — can apply AIDO.RNA to estimate translation efficiency, predict off-target hybridization, and optimize codon usage for maximum protein output in therapeutic contexts. Splicing researchers and RNA processing biologists benefit from the model's strong splice site prediction performance and its ability to model the regulatory sequence grammar governing alternative splicing, exon inclusion, and intron retention events. Epitranscriptomics researchers studying RNA modifications such as m6A, pseudouridine, and m5C can apply the model's modification site prediction capability to prioritize sites for experimental validation. For synthetic biology applications, AIDO.RNA's sequence design capability enables the rational engineering of regulatory RNA elements with specified functional properties, a valuable tool for metabolic engineering and genetic circuit design in both prokaryotic and eukaryotic systems.
AIDO.RNA establishes a new benchmark for RNA foundation models in terms of both scale and breadth of task coverage. The achievement of state-of-the-art results on 24 of 26 RNA understanding benchmarks without task-specific architectural modifications demonstrates the power of large-scale pretraining for RNA biology and validates the foundation model approach for this modality. As a component of the AIDO multiscale platform, AIDO.RNA is positioned to contribute to cross-modal biological reasoning in combination with AIDO.DNA, AIDO.Protein, and AIDO.Cell — a long-term vision of modeling biology at all relevant scales within a unified computational framework. The open release of model weights, including a coding-sequence-optimized variant, and integration with the AIDO.ModelGenerator fine-tuning stack lowers barriers for adoption by research groups without dedicated ML infrastructure. A notable limitation of the current model is that it was pretrained on non-coding RNA sequences from RNAcentral, with limited coverage of full mRNA sequences including untranslated regions and coding sequences in their native genomic context — a gap partially addressed by the CDS variant but relevant for researchers working on full-length transcript modeling. The field will benefit from future work examining how AIDO.RNA representations compare to those of RNA structure prediction tools that incorporate explicit thermodynamic priors, and whether the foundation model approach can ultimately surpass physics-based approaches for RNA secondary structure prediction at scale.