Multimodal reverse-translation language model that generates species-aware mRNA coding sequences from protein sequences, conditioned on host taxonomy.
Pro2RNA addresses the reverse-translation problem: given a target protein sequence, which mRNA coding sequence should encode it? Because the genetic code is degenerate, a single protein can be specified by an astronomical number of synonymous coding sequences, and the "best" choice depends on the host organism. Codon usage, GC content, and other sequence features that govern translation efficiency and expression differ markedly across species. Pro2RNA reframes this design task as a conditional generative language-modeling problem and makes the host organism an explicit input.
Developed by researchers at Kitasato University and collaborators (preprint posted to bioRxiv in March 2026), Pro2RNA is a multimodal model that couples three pretrained encoders. Protein sequences are embedded with ESM2, taxonomic descriptions of the host (domain, phylum, order, family) are embedded with the scientific-text model SciBERT, and these conditioning signals are fed into a generative RNA language model that emits the coding sequence codon by codon.
By learning species-dependent genetic codes and codon-usage patterns directly from data, Pro2RNA generates host-adapted, natural-like coding sequences without relying on the hand-built frequency tables used by classical codon-optimization tools. It sits alongside generative mRNA models such as those used for vaccine and therapeutic design, but is distinguished by its taxonomy-aware, protein-conditioned formulation.
Pro2RNA keeps the ESM2 protein encoder and SciBERT taxonomy encoder frozen, introducing Low-Rank Adaptation (LoRA) modules for parameter-efficient adaptation. The protein and taxonomy representations are concatenated, projected through an MLP, and used to condition a generative RNA language model that produces the coding sequence one codon at a time. Training uses paired mRNA and protein sequences drawn from eukaryotic and bacterial genomes, allowing the model to internalize species-dependent genetic codes and codon-usage biases. The reported evaluations emphasize the naturalness and host-adaptation of generated sequences relative to the source organism's codon distribution.
Pro2RNA is intended for designing mRNA coding sequences in settings where host context matters: mRNA vaccines and nucleic-acid therapeutics, heterologous and recombinant protein expression, and synthetic-biology constructs that must be tuned for a specific production organism. By conditioning on taxonomy, it offers a single model that can propose codon choices for many hosts, which is useful to molecular biologists and bioprocess engineers who otherwise switch between species-specific optimization heuristics.
Pro2RNA advances the use of language models for sequence design by treating reverse translation as a conditional generation task and by making host taxonomy a first-class input rather than a post-hoc filter. As a preprint without released code or weights and under a CC-BY-NC license, its near-term impact is primarily methodological, illustrating how frozen foundation encoders for protein, text, and RNA can be composed for species-aware mRNA design. Independent benchmarking against established codon-optimization tools will help clarify its practical advantages.