BERT-based RNA foundation model pre-trained on 1 billion sequences, achieving state-of-the-art performance in secondary structure, tertiary structure, and functional annotation tasks.
UNI-RNA is a universal RNA foundation model developed by DP Technology and released in July 2023. It was pre-trained on an unprecedented dataset of 1 billion RNA sequences drawn from multiple species and diverse RNA categories, making it the largest-scale RNA pre-training effort reported at the time. The model learns context-aware representations that capture evolutionary conservation patterns and structural constraints embedded in RNA sequences without requiring labeled data, then transfers this knowledge to downstream prediction tasks through fine-tuning.
The model addresses a long-standing bottleneck in RNA research: the scarcity of experimentally determined structures relative to the rapidly growing number of known RNA sequences. By learning from the statistical regularities of a billion sequences, UNI-RNA extracts signals that correlate with structure and function in a way that was previously inaccessible to smaller models trained on curated, annotated datasets.
UNI-RNA is an ensemble of models spanning 25 million to 400 million parameters. The 400M-parameter variant consistently achieves the strongest downstream performance, with empirical results suggesting the model reached a performance plateau at that scale given its architecture and training data. Access to the model and associated notebooks is provided through DP Technology's Bohrium platform rather than a traditional open-source repository.
UNI-RNA uses a BERT-style transformer encoder as its backbone, enhanced by three architectural choices that distinguish it from earlier RNA language models. Rotary position embeddings replace absolute positional encodings, providing better generalization across variable-length RNA sequences. Flash attention reduces memory overhead and accelerates training on long sequences. Fused layer normalization combines normalization operations to improve throughput and numerical stability during pre-training.
The training corpus comprises sequences from five established RNA databases covering both coding RNAs (mRNAs) and non-coding RNAs (ncRNAs including rRNA, tRNA, lncRNA, and others), spanning multiple kingdoms of life. MMseqs2 clustering was applied to reduce redundancy before pre-training, ensuring diversity. The model was pre-trained using masked language modeling on this 1 billion sequence corpus. For secondary structure prediction, structural contact maps are represented as two-dimensional distance matrices encoding pairwise inter-nucleotide distances in both 2D and 3D space. Benchmarked against the bpRNA-1m and PDB-derived test sets, UNI-RNA outperformed all previous methods in F1-score, precision, and recall at the time of publication.
UNI-RNA is suited to any RNA research task where learned sequence representations can substitute for or complement experimental data. Structural biologists can use the model for secondary structure prediction and tertiary contact prediction as priors before investing in experimental validation. Functional genomics researchers can fine-tune the backbone for regulatory element annotation, RNA-binding protein interaction prediction, or classification of novel ncRNA families. In therapeutic contexts, UNI-RNA may help accelerate mRNA optimization, identification of RNA-based drug targets, and characterization of ribozymes or aptamers. The Bohrium platform hosting provides notebook-based access that lowers the barrier for wet-lab groups without dedicated computational infrastructure.
UNI-RNA established the largest RNA pre-training benchmark at the time of its release and demonstrated that scaling sequence-level training beyond 100 million parameters yields measurable gains across a range of downstream RNA tasks. It contributed to a broader wave of RNA foundation models — including RNA-FM, RiNALMo, and ERNIE-RNA — that collectively shifted the field toward transfer learning approaches analogous to those pioneered in protein language models. A notable limitation is that UNI-RNA remains a preprint as of early 2026 and has not undergone formal peer review, and its access model through Bohrium notebooks is less open than repositories such as RNA-FM, restricting community-driven reproducibility. The paper's scope is also primarily limited to sequence-based tasks; explicitly modeling RNA 3D coordinates at atomic resolution, as done by tools like RhoFold+, lies outside UNI-RNA's current framework.
Wang, X., Gu, R., Chen, Z., Li, Y., Ji, X., Ke, G., & Wen, H. (2023). UNI-RNA: Universal Pre-trained Models Revolutionize RNA Research. bioRxiv, 2023.07.11.548588.
DOI: 10.1101/2023.07.11.548588