NUWA

mRNA language foundation model trained on ~115M protein-coding sequences across the tree of life, unifying mRNA perception and generation.

Released: February 2026

NUWA is a large language foundation model for messenger RNA (mRNA), developed by researchers at Kitasato University and released as a bioRxiv preprint in late 2025 (with revisions into early 2026). Most nucleotide language models to date have been trained on genomic DNA or on non-coding/regulatory contexts; NUWA instead focuses specifically on protein-coding sequences, learning the statistical structure of mRNA across a broad sampling of the tree of life. This positions it as a specialized counterpart to general DNA models like the Nucleotide Transformer and to protein language models such as ESM.

The model is designed to be "unified" in the sense that a single pretrained backbone supports both sequence perception — understanding and scoring existing mRNA — and sequence generation — proposing novel coding sequences with desired properties. This dual capability targets a practical bottleneck in mRNA therapeutic and vaccine design, where candidate sequences must be both biologically plausible and optimized for expression.

NUWA was trained at substantial taxonomic breadth, drawing coding sequences from bacterial, eukaryotic, and archaeal genomes, which gives the model exposure to the codon usage and sequence patterns that differ across domains of life.

Key Features

mRNA-specialized pretraining: Trained directly on protein-coding sequences rather than whole genomes, focusing representational capacity on the mRNA design space relevant to therapeutics.
Unified perception and generation: A single backbone supports both scoring and interpreting existing mRNA and generating novel coding sequences.
Cross-kingdom training data: Sequences span bacteria, eukaryotes, and archaea, exposing the model to diverse codon usage and compositional patterns.
Therapeutic orientation: Explicitly motivated by mRNA vaccine and therapeutic design, where sequence optimization affects expression and stability.

Technical Details

NUWA uses a BERT-style transformer architecture trained with self-supervised masked-language modeling on mRNA coding sequences. According to the preprint, the training corpus comprised approximately 80 million sequences from 19,676 bacterial species, 33 million from 4,688 eukaryotic species, and 2.1 million from 702 archaeal species — roughly 115 million coding sequences in total. The model is released under a CC BY-NC-ND license. As a recent preprint, full architectural hyperparameters, parameter counts, and released weights should be confirmed against the latest version of the manuscript; published benchmark comparisons against other nucleotide language models are described in the paper itself.

Applications

NUWA is aimed at researchers designing and optimizing mRNA sequences, including teams working on mRNA vaccines and protein-replacement or other therapeutic mRNAs. Its perception capabilities can be used to score or annotate candidate sequences, while its generative mode can propose novel coding sequences — potentially tuned for expression or for properties learned from the diverse training corpus — that can then be triaged computationally before wet-lab synthesis and testing.

Impact

NUWA contributes to a growing class of nucleotide foundation models that move beyond genomic DNA toward functionally focused, generation-capable RNA models. By concentrating on coding sequences and supporting both understanding and design in one framework, it offers mRNA-therapeutics researchers a task-aligned alternative to repurposing general genomic language models. As a recent preprint with a non-commercial license, its broader adoption and independent validation remain to be established.

Citation

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Preprint

Zhong, Y., et al. (2026) Large mRNA language foundation modeling with NUWA for unified sequence perception and generation. bioRxiv.

DOI: 10.1101/2025.11.01.686058

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References70

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

16Closed

Usability — can I run it?17

Reproducibility — can I retrain it?13

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

mRNA-specialized pretraining: Trained directly on protein-coding sequences rather than whole genomes, focusing representational capacity on the mRNA design space relevant to therapeutics.

Unified perception and generation: A single backbone supports both scoring and interpreting existing mRNA and generating novel coding sequences.

Cross-kingdom training data: Sequences span bacteria, eukaryotes, and archaea, exposing the model to diverse codon usage and compositional patterns.

Therapeutic orientation: Explicitly motivated by mRNA vaccine and therapeutic design, where sequence optimization affects expression and stability.

Technical Details

Applications

Impact

NUWA

Key Features

Technical Details

Applications

Impact

Citation

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

NUWA

Key Features

Technical Details

Applications

Impact

Citation

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

NUWA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

NUWA

#Key Features

#Technical Details

#Applications

#Impact

Citation

Large mRNA language foundation modeling with NUWA for unified sequence perception and generation

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact