mRNA-GPT

Kitasato University / University of Tokyo / National Institute of Advanced Industrial Science and Technology

GPT-style generative language model for mRNA coding sequences, pretrained across bacteria, eukaryotes, and archaea for de novo CDS design.

Released: December 2025

Parameters: 302 Million

mRNA-GPT is a generative language foundation model for messenger RNA coding sequences (CDS), developed by researchers at Kitasato University together with collaborators at the University of Tokyo and Japan's National Institute of Advanced Industrial Science and Technology (AIST), and posted to bioRxiv in December 2025. While protein and DNA language models have proliferated, generative language models aimed specifically at de novo mRNA sequence design have remained comparatively underexplored, and mRNA-GPT is positioned as the first large generative language model built explicitly for mRNA coding-sequence generation and design.

This entry describes the Kitasato/Saito-group model and is distinct from the similarly named Sanofi-affiliated "mRNA-GPT" (a separate full-length 5' UTR–CDS–3' UTR design model). The Kitasato mRNA-GPT instead focuses on the coding sequence and is trained in a domain-aware fashion across bacteria, eukaryotes, and archaea, so that a single architecture captures lineage-specific regularities such as codon-usage regimes and compositional preferences.

The model is intended as a reusable backbone: after large-scale autoregressive pretraining, it can be steered toward concrete therapeutic and research objectives through lightweight fine-tuning on property-labeled data, rather than relying on fixed codon-optimization tables.

Key Features

Three-domain pretraining: Trained on coding sequences spanning bacteria, eukaryotes, and archaea, giving the model broad coverage of natural mRNA diversity across the tree of life.
Domain-aware specialist models: Separate bacteria, eukaryote, and archaea models are pretrained so each encodes lineage-specific codon usage and composition rather than averaging across organisms.
Task-conditioned generation: Lightweight fine-tuning on property-relevant datasets steers generation toward objectives such as translation efficiency, mRNA stability, or expression while retaining naturalness.
Reinforcement-learning optimization: The framework supports Proximal Policy Optimization (PPO) with oracle-based reward signals to iteratively optimize target properties such as half-life and translation efficiency.
Public training and inference code: A GitHub repository releases the training, supervised fine-tuning, PPO, and inference/preprocessing code, though it carries no LICENSE file and is therefore effectively all-rights-reserved.

Technical Details

mRNA-GPT uses a GPT-2-style decoder-only transformer architecture (reported at roughly 24 layers) with approximately 302 million parameters, trained autoregressively on mRNA coding sequences collected and preprocessed from NCBI. Pretraining is split across three domains of life: ~80 million bacterial CDS from 19,676 species, ~83 million eukaryotic CDS from 4,688 species, and ~2 million archaeal CDS from 702 species — roughly 165 million coding sequences in aggregate. Training each domain model separately lets the series encode domain-specific codon and compositional structure.

For downstream design, the pretrained model is fine-tuned on property-specific datasets. The preprint reports that fine-tuned mRNA-GPT generates sequences with significantly higher translation-efficiency scores, and that additional fine-tuning on mRNA-stability and mRNA-expression datasets likewise yields high-performing designed sequences, with an optional PPO-based reinforcement-learning stage for direct property optimization.

Applications

mRNA-GPT targets the design of mRNA coding sequences for therapeutic and research use — including vaccines, protein-replacement therapies, and other mRNA-based modalities — where translation efficiency, stability, and expression level are key product attributes. Because the backbone is fine-tunable, groups can adapt it to specialized objectives such as tissue-specific expression or organism-specific stability using their own labeled data, integrating generative design into existing mRNA-engineering pipelines without building a model from scratch.

Impact

mRNA-GPT contributes a generative, foundation-model approach to a part of the mRNA-design problem space that had been dominated by tabular codon-optimization heuristics and discriminative property predictors. The authors release training and inference code via a public GitHub repository, though it ships without a license and is thus effectively all-rights-reserved. The preprint states that pretrained models are publicly available, but the openly findable checkpoints do not clearly substantiate this: the only weight-bearing HuggingFace repository located is a GPT-2 config of roughly 124 million parameters (12 layers), which does not match the paper's stated ~302M / 24-layer model, and it ships an empty model card with no license (a companion repository contains only a tokenizer). Open availability of weights matching the 302M model therefore could not be confirmed. As a December 2025 preprint, its empirical claims await peer review and broader independent benchmarking, but it establishes a clear template for generative mRNA coding-sequence design, complementing related efforts on mRNA representation models (mRNA-FM) and full-length therapeutic mRNA design tools.

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Bian, B., et al. (2025) Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT. bioRxiv.

DOI: 10.64898/2025.12.22.695962

Recent citations

Papers that recently cited this model.

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA
Bian Bian, Yiming Zhang, Jichen Zhang, et al.
bioRxiv · Mar 2026
0

Top citations

The most-cited papers that cite this model.

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA
Bian Bian, Yiming Zhang, Jichen Zhang, et al.
bioRxiv · Mar 2026
0

Citations

Total Citations1

Influential0

References49

GitHub

Stars4

Forks0

Open Issues0

Contributors1

Last Push8mo ago

LanguagePython

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

39Closed

Usability — can I run it?38

Reproducibility — can I retrain it?27

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Three-domain pretraining: Trained on coding sequences spanning bacteria, eukaryotes, and archaea, giving the model broad coverage of natural mRNA diversity across the tree of life.

Domain-aware specialist models: Separate bacteria, eukaryote, and archaea models are pretrained so each encodes lineage-specific codon usage and composition rather than averaging across organisms.

Task-conditioned generation: Lightweight fine-tuning on property-relevant datasets steers generation toward objectives such as translation efficiency, mRNA stability, or expression while retaining naturalness.

Reinforcement-learning optimization: The framework supports Proximal Policy Optimization (PPO) with oracle-based reward signals to iteratively optimize target properties such as half-life and translation efficiency.

Public training and inference code: A GitHub repository releases the training, supervised fine-tuning, PPO, and inference/preprocessing code, though it carries no LICENSE file and is therefore effectively all-rights-reserved.

Technical Details

Applications

Impact

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Bian, B., et al. (2025) Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT. bioRxiv.

DOI: 10.64898/2025.12.22.695962

mRNA-GPT

Key Features

Technical Details

Applications

Impact

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Recent citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Top citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

mRNA-GPT

Key Features

Technical Details

Applications

Impact

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Recent citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Top citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

mRNA-GPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Recent citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Top citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

mRNA-GPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

Large generative mRNA language foundation model for efficient coding sequence generation and design with mRNA-GPT

Recent citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Top citations

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact