University of Maryland, College Park
A 150M-parameter conditional codon language model that generates species-optimized coding sequences from a protein and its taxonomic lineage.
Most amino acids are specified by more than one codon, so any given protein can be encoded by an astronomical number of synonymous coding sequences. Which codons an organism actually uses is far from arbitrary: synonymous codon choice shapes translation speed, mRNA stability, and ultimately how much protein a cell makes. This matters acutely in biotechnology, where a human gene expressed in E. coli, yeast, or CHO cells often yields little protein unless its sequence is "codon-optimized" for the host. Classical optimization tools lean on simple heuristics — matching the host's most frequent codons or smoothing the codon adaptation index — that ignore the rich, context-dependent grammar of natural sequences.
CodonTranslator, released in November 2025 by researchers at the University of Maryland, College Park, reframes codon optimization as a conditional sequence-generation problem. It is a 150-million-parameter decoder-only Transformer that generates a coding DNA sequence given a target protein and a target species, learning the relationship between protein context, taxonomic lineage, and codon choice directly from natural data. Rather than scoring sequences against a single host's codon table, the model treats the genetic code of life as a translatable space and emits designs tailored to a requested organism.
A distinguishing design choice is the model's explicit conditioning on hierarchical species lineages, which lets it generalize codon usage patterns to species with little or no training data — including organisms outside the training set entirely. This positions CodonTranslator as a cross-species generative tool rather than a per-host optimizer. (It should not be confused with CodonFM, the separate codon-resolution language-model family from NVIDIA and the Arc Institute; CodonFM is built primarily for analysis and variant interpretation, whereas CodonTranslator is a conditional generator focused on optimization across and beyond its training species.)
CodonTranslator is a decoder-only Transformer with roughly 150 million parameters — reported as a hidden dimension of 750, 20 layers, 15 attention heads, and an MLP ratio of 3.2 — trained in BF16 with fully-sharded data parallelism. Training used over 62 million CDS–protein pairs drawn from 2,100+ species. The released dataset (CodonTranslator-data) provides representative-only train/validation/test shards (about 36.9M / 374K / 331K rows) de-duplicated by MMseqs clustering in protein space, with splits engineered to eliminate exact-protein leakage and to hold out entire species for testing — so reported generalization reflects genuinely unseen organisms rather than memorized sequences. Precomputed species-conditioning embeddings and a taxonomy database accompany the data. According to the preprint, generated designs meet or exceed existing optimization methods on codon-usage metrics and on predicted mRNA stability; the public model card does not yet tabulate these benchmark numbers directly.
CodonTranslator targets the everyday problem of expressing a protein of interest in a chosen host. Protein engineers and synthetic biologists can request a coding sequence for a specific organism and receive a species-tailored design without curating codon tables by hand, which is useful for recombinant protein production, metabolic-pathway engineering, and synthetic-gene construction. The lineage conditioning is particularly valuable for non-model and rare organisms where reference codon statistics are sparse, and the protein-context conditioning suits cases where local sequence features influence expression. Because amino-acid identity is enforced during sampling, every output is a valid synonymous design, which lowers the barrier to plugging the tool into existing gene-synthesis pipelines.
By casting codon optimization as conditional generation over the full diversity of life, CodonTranslator moves the field beyond single-host frequency heuristics toward learned, transferable codon grammar. Its open release — inference-ready weights under an MIT license on HuggingFace, a documented training dataset, and accompanying code — makes it readily usable and reproducible by the synthetic-biology community. As a preprint released in late 2025, its real-world expression gains await independent wet-lab validation, and like other coding-sequence models it does not model untranslated regions that also govern expression. Its main contribution is demonstrating that explicit species-lineage conditioning enables a single generative model to optimize codons across, and beyond, the species it was trained on.
Chen, Y., et al. (2025) CodonTranslator: a conditional codon language model for codon optimization across life domains. bioRxiv.
DOI: 10.1101/2025.11.24.690310Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data