bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNA foundation models
RNAProtein

Pro2RNA

Kitasato University

Multimodal reverse-translation language model that generates species-aware mRNA coding sequences from protein sequences, conditioned on host taxonomy.

Released: March 2026

Pro2RNA addresses the reverse-translation problem: given a target protein sequence, which mRNA coding sequence should encode it? Because the genetic code is degenerate, a single protein can be specified by an astronomical number of synonymous coding sequences, and the "best" choice depends on the host organism. Codon usage, GC content, and other sequence features that govern translation efficiency and expression differ markedly across species. Pro2RNA reframes this design task as a conditional generative language-modeling problem and makes the host organism an explicit input.

Developed by researchers at Kitasato University and collaborators (preprint posted to bioRxiv in March 2026), Pro2RNA is a multimodal model that couples three pretrained encoders. Protein sequences are embedded with ESM2, taxonomic descriptions of the host (domain, phylum, order, family) are embedded with the scientific-text model SciBERT, and these conditioning signals are fed into a generative RNA language model that emits the coding sequence codon by codon.

By learning species-dependent genetic codes and codon-usage patterns directly from data, Pro2RNA generates host-adapted, natural-like coding sequences without relying on the hand-built frequency tables used by classical codon-optimization tools. It sits alongside generative mRNA models such as those used for vaccine and therapeutic design, but is distinguished by its taxonomy-aware, protein-conditioned formulation.

#Key Features

  • Reverse translation as language modeling: Generates full mRNA coding sequences from a protein sequence at codon-level resolution, rather than independently selecting codons position by position.
  • Taxonomy conditioning: Encodes host organism taxonomy with SciBERT so the same protein can be optimized for expression in different species.
  • Multimodal fusion: Combines a frozen ESM2 protein encoder, a frozen SciBERT taxonomy encoder, and a generative RNA language model, with lightweight LoRA adapters for efficient task-specific tuning.
  • Cross-kingdom training: Trained on mRNA-protein pairs from both eukaryotic and bacterial datasets, capturing species-specific codon preferences across kingdoms.

#Technical Details

Pro2RNA keeps the ESM2 protein encoder and SciBERT taxonomy encoder frozen, introducing Low-Rank Adaptation (LoRA) modules for parameter-efficient adaptation. The protein and taxonomy representations are concatenated, projected through an MLP, and used to condition a generative RNA language model that produces the coding sequence one codon at a time. Training uses paired mRNA and protein sequences drawn from eukaryotic and bacterial genomes, allowing the model to internalize species-dependent genetic codes and codon-usage biases. The reported evaluations emphasize the naturalness and host-adaptation of generated sequences relative to the source organism's codon distribution.

#Applications

Pro2RNA is intended for designing mRNA coding sequences in settings where host context matters: mRNA vaccines and nucleic-acid therapeutics, heterologous and recombinant protein expression, and synthetic-biology constructs that must be tuned for a specific production organism. By conditioning on taxonomy, it offers a single model that can propose codon choices for many hosts, which is useful to molecular biologists and bioprocess engineers who otherwise switch between species-specific optimization heuristics.

#Impact

Pro2RNA advances the use of language models for sequence design by treating reverse translation as a conditional generation task and by making host taxonomy a first-class input rather than a post-hoc filter. As a preprint without released code or weights and under a CC-BY-NC license, its near-term impact is primarily methodological, illustrating how frozen foundation encoders for protein, text, and RNA can be composed for species-aware mRNA design. Independent benchmarking against established codon-optimization tools will help clarify its practical advantages.

Tags

mrna_designcodon_optimizationsequence_generationtransformerlanguage_modelmultimodalmrnacodon_usage