bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNA foundation models
RNA

mRNA-GPT

Kitasato University / University of Tokyo / National Institute of Advanced Industrial Science and Technology

GPT-2-style generative foundation model pretrained on ~165M mRNA coding sequences across all three domains of life for de novo CDS generation and design.

Released: December 2025
Parameters: 302 Million

mRNA-GPT is a generative language foundation model for messenger RNA coding sequences (CDS), developed by researchers at Kitasato University together with collaborators at the University of Tokyo and Japan's National Institute of Advanced Industrial Science and Technology (AIST), and posted to bioRxiv in December 2025. While protein and DNA language models have proliferated, generative language models aimed specifically at de novo mRNA sequence design have remained comparatively underexplored, and mRNA-GPT is positioned as the first large generative language model built explicitly for mRNA coding-sequence generation and design.

This entry describes the Kitasato/Saito-group model and is distinct from the similarly named Sanofi-affiliated "mRNA-GPT" (a separate full-length 5' UTR–CDS–3' UTR design model). The Kitasato mRNA-GPT instead focuses on the coding sequence and is trained in a domain-aware fashion across bacteria, eukaryotes, and archaea, so that a single architecture captures lineage-specific regularities such as codon-usage regimes and compositional preferences.

The model is intended as a reusable backbone: after large-scale autoregressive pretraining, it can be steered toward concrete therapeutic and research objectives through lightweight fine-tuning on property-labeled data, rather than relying on fixed codon-optimization tables.

#Key Features

  • Three-domain pretraining: Trained on coding sequences spanning bacteria, eukaryotes, and archaea, giving the model broad coverage of natural mRNA diversity across the tree of life.
  • Domain-aware specialist models: Separate bacteria, eukaryote, and archaea models are pretrained so each encodes lineage-specific codon usage and composition rather than averaging across organisms.
  • Task-conditioned generation: Lightweight fine-tuning on property-relevant datasets steers generation toward objectives such as translation efficiency, mRNA stability, or expression while retaining naturalness.
  • Reinforcement-learning optimization: The framework supports Proximal Policy Optimization (PPO) with oracle-based reward signals to iteratively optimize target properties such as half-life and translation efficiency.
  • Public training and inference code: A GitHub repository releases the training, supervised fine-tuning, PPO, and inference/preprocessing code, though it carries no LICENSE file and is therefore effectively all-rights-reserved.

#Technical Details

mRNA-GPT uses a GPT-2-style decoder-only transformer architecture (reported at roughly 24 layers) with approximately 302 million parameters, trained autoregressively on mRNA coding sequences collected and preprocessed from NCBI. Pretraining is split across three domains of life: ~80 million bacterial CDS from 19,676 species, ~83 million eukaryotic CDS from 4,688 species, and ~2 million archaeal CDS from 702 species — roughly 165 million coding sequences in aggregate. Training each domain model separately lets the series encode domain-specific codon and compositional structure.

For downstream design, the pretrained model is fine-tuned on property-specific datasets. The preprint reports that fine-tuned mRNA-GPT generates sequences with significantly higher translation-efficiency scores, and that additional fine-tuning on mRNA-stability and mRNA-expression datasets likewise yields high-performing designed sequences, with an optional PPO-based reinforcement-learning stage for direct property optimization.

#Applications

mRNA-GPT targets the design of mRNA coding sequences for therapeutic and research use — including vaccines, protein-replacement therapies, and other mRNA-based modalities — where translation efficiency, stability, and expression level are key product attributes. Because the backbone is fine-tunable, groups can adapt it to specialized objectives such as tissue-specific expression or organism-specific stability using their own labeled data, integrating generative design into existing mRNA-engineering pipelines without building a model from scratch.

#Impact

mRNA-GPT contributes a generative, foundation-model approach to a part of the mRNA-design problem space that had been dominated by tabular codon-optimization heuristics and discriminative property predictors. The authors release training and inference code via a public GitHub repository, though it ships without a license and is thus effectively all-rights-reserved. The preprint states that pretrained models are publicly available, but the openly findable checkpoints do not clearly substantiate this: the only weight-bearing HuggingFace repository located is a GPT-2 config of roughly 124 million parameters (12 layers), which does not match the paper's stated ~302M / 24-layer model, and it ships an empty model card with no license (a companion repository contains only a tokenizer). Open availability of weights matching the 302M model therefore could not be confirmed. As a December 2025 preprint, its empirical claims await peer review and broader independent benchmarking, but it establishes a clear template for generative mRNA coding-sequence design, complementing related efforts on mRNA representation models (mRNA-FM) and full-length therapeutic mRNA design tools.

Citation

DOI: 10.64898/2025.12.22.695962

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
39Closed
Usability — can I run it?38
Reproducibility — can I retrain it?27
Model Openness Framework
Unclassified
Missing required components

Tags

codoncodon_optimizationfoundation_modelgenerativegptlanguage_modelmrnamrna_designsequence_generationtransformer

Resources

GitHub RepositoryResearch Paper