bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNA

ProtmRNA

Fudan University / Shanghai AI Laboratory / Hunan Normal University

A cross-modal transfer-learning model that adapts the ESM-2 650M protein language model to mRNA analysis by swapping amino-acid tokens for codon tokens, applied to mRNA benchmarks without re-training.

Released: May 2026
Parameters: 650 Million

Messenger RNA (mRNA) has become a central modality in modern therapeutics and a rich substrate for computational modeling, yet mRNA-specific foundation models face a chronic shortage of labeled data and the considerable expense of pretraining large language models from scratch. Protein language models, by contrast, have been trained on hundreds of millions of sequences and encode a deep statistical understanding of how biological sequences fold, function, and evolve. ProtmRNA is built on the observation that much of this knowledge is transferable: because coding sequences and the proteins they specify are two views of the same underlying biology, a protein model's representations can be repurposed for the messenger RNA that encodes them.

ProtmRNA, described in a bioRxiv preprint posted on 20 May 2026 by Gang Xu, Xinyu Wu, and Jianpeng Ma (Fudan University, Shanghai AI Laboratory, and Hunan Normal University), implements this idea as a cross-modal transfer-learning recipe. The authors take ESM-2 650M — a 33-layer, 1280-dimensional protein transformer — and substitute its amino-acid vocabulary with a 78-token codon vocabulary, then continue pretraining at codon resolution. The reused architecture and inherited protein-derived weights let the model converge using less than half the training compute of a comparable model trained from scratch.

The model's most distinctive claim is generality without task-specific retraining: after codon-level pretraining, ProtmRNA is applied directly to eight downstream mRNA benchmarks — spanning mRNA stability, gene expression, transcript abundance, and SARS-CoV-2 vaccine degradation — without further fine-tuning of the backbone. This positions ProtmRNA alongside codon-resolution models such as CodonFM and generative designers such as mRNA-GPT, but with a transfer-learning emphasis distinct from both.

#Key Features

  • Cross-modal protein-to-mRNA transfer: Initializes from ESM-2 650M protein weights rather than training from scratch, transferring sequence knowledge learned on proteins to messenger RNA and substantially reducing pretraining cost.
  • Codon-level tokenization: Replaces the amino-acid vocabulary with a 78-token codon vocabulary so the model reads coding sequences at codon resolution, preserving synonymous-codon information invisible to amino-acid-only models.
  • Architecture preservation: Retains the full 33-layer, 1280-dimensional ESM-2 backbone, allowing inherited protein weights to be reused with minimal structural modification.
  • Compute-efficient pretraining: Converges at under 50% of the training compute required for an equivalent from-scratch model, a direct benefit of weight reuse.
  • Retraining-free downstream use: Applied to eight mRNA benchmarks without re-training the backbone, testing the transferred representations as general-purpose mRNA features.

#Technical Details

ProtmRNA is a transformer encoder derived from ESM-2 650M, comprising 33 layers and a hidden dimension of 1280 (~650M parameters). The central architectural change is the input vocabulary: the amino-acid tokens are replaced with 78 codon tokens, and the model is pretrained at codon level so its representations are conditioned on the actual coding sequence rather than the translated protein. The remainder of the backbone is preserved so that protein-pretrained weights provide the initialization, which the authors report enables convergence at under half the compute of a from-scratch baseline.

For evaluation, the pretrained model is applied to eight downstream mRNA tasks without retraining the backbone, covering mRNA stability, gene expression, transcript abundance, and SARS-CoV-2 mRNA-vaccine degradation prediction. The preprint reports the architecture, the codon-vocabulary construction, the transfer-learning procedure, and the comparative compute savings. As a bioRxiv preprint (DOI 10.64898/2026.05.19.726141, CC BY-NC-ND 4.0), these results have not yet undergone peer review.

#Applications

ProtmRNA targets mRNA analysis problems where labeled data is scarce and pretraining from scratch is impractical — predicting mRNA stability, gene expression, and transcript abundance, and forecasting degradation of mRNA-vaccine constructs as illustrated by its SARS-CoV-2 benchmark. Because the model reuses an existing protein backbone and is applied without task-specific backbone retraining, it is attractive to RNA-therapeutics and computational-biology groups that want strong codon-level representations without the cost of training a dedicated mRNA foundation model.

#Impact

ProtmRNA contributes a concrete demonstration that protein language models can serve as a starting point for messenger RNA modeling, reframing the data and compute bottleneck in mRNA foundation models as a transfer-learning problem rather than a from-scratch pretraining one. Its broad, retraining-free evaluation across eight benchmarks is its main evidence of generality. Two caveats temper adoption: the work is an unreviewed preprint, and the released code carries no stated license while pretrained weights are distributed via Google Drive rather than a persistent archive such as HuggingFace or Zenodo, which affects long-term reproducibility and reuse. As released, no formal model card or data card accompanies the repository beyond the README and the preprint.

Citation

ProtmRNA: Cross-Modal Knowledge Transfer from Proteins to Messenger RNA

Xu, G., et al. (2026) ProtmRNA: Cross-Modal Knowledge Transfer from Proteins to Messenger RNA. bioRxiv.

DOI: 10.64898/2026.05.19.726141

Openness

Unclassified
Restrictive license on core components

Tags

codongene_expressionlanguage_modelmrnaproteomicsrna_stability_predictiontransfer_learningtransformervariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch Paper