An mRNA language foundation model trained on ~115M coding sequences across the tree of life for unified mRNA sequence perception and generation.
NUWA is a large language foundation model for messenger RNA (mRNA), developed by researchers at Kitasato University and released as a bioRxiv preprint in late 2025 (with revisions into early 2026). Most nucleotide language models to date have been trained on genomic DNA or on non-coding/regulatory contexts; NUWA instead focuses specifically on protein-coding sequences, learning the statistical structure of mRNA across a broad sampling of the tree of life. This positions it as a specialized counterpart to general DNA models like the Nucleotide Transformer and to protein language models such as ESM.
The model is designed to be "unified" in the sense that a single pretrained backbone supports both sequence perception — understanding and scoring existing mRNA — and sequence generation — proposing novel coding sequences with desired properties. This dual capability targets a practical bottleneck in mRNA therapeutic and vaccine design, where candidate sequences must be both biologically plausible and optimized for expression.
NUWA was trained at substantial taxonomic breadth, drawing coding sequences from bacterial, eukaryotic, and archaeal genomes, which gives the model exposure to the codon usage and sequence patterns that differ across domains of life.
NUWA uses a BERT-style transformer architecture trained with self-supervised masked-language modeling on mRNA coding sequences. According to the preprint, the training corpus comprised approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species — roughly 115 million coding sequences in total. The model is released under a CC BY-NC-ND license. As a recent preprint, full architectural hyperparameters, parameter counts, and released weights should be confirmed against the latest version of the manuscript; published benchmark comparisons against other nucleotide language models are described in the paper itself.
NUWA is aimed at researchers designing and optimizing mRNA sequences, including teams working on mRNA vaccines and protein-replacement or other therapeutic mRNAs. Its perception capabilities can be used to score or annotate candidate sequences, while its generative mode can propose novel coding sequences — potentially tuned for expression or for properties learned from the diverse training corpus — that can then be triaged computationally before wet-lab synthesis and testing.
NUWA contributes to a growing class of nucleotide foundation models that move beyond genomic DNA toward functionally focused, generation-capable RNA models. By concentrating on coding sequences and supporting both understanding and design in one framework, it offers mRNA-therapeutics researchers a task-aligned alternative to repurposing general genomic language models. As a recent preprint with a non-commercial license, its broader adoption and independent validation remain to be established.