Most amino acids are specified by more than one codon, so any given protein can be encoded by an astronomical number of synonymous coding sequences. Which codons an organism actually uses is far from arbitrary: synonymous codon choice shapes translation speed, mRNA stability, and ultimately how much protein a cell makes. This matters acutely in biotechnology, where a human gene expressed in E. coli, yeast, or CHO cells often yields little protein unless its sequence is "codon-optimized" for the host. Classical optimization tools lean on simple heuristics — matching the host's most frequent codons or smoothing the codon adaptation index — that ignore the rich, context-dependent grammar of natural sequences.

CodonTranslator, released in November 2025 by researchers at the University of Maryland, College Park, reframes codon optimization as a conditional sequence-generation problem. It is a 150-million-parameter decoder-only Transformer that generates a coding DNA sequence given a target protein and a target species, learning the relationship between protein context, taxonomic lineage, and codon choice directly from natural data. Rather than scoring sequences against a single host's codon table, the model treats the genetic code of life as a translatable space and emits designs tailored to a requested organism.

A distinguishing design choice is the model's explicit conditioning on hierarchical species lineages, which lets it generalize codon usage patterns to species with little or no training data — including organisms outside the training set entirely. This positions CodonTranslator as a cross-species generative tool rather than a per-host optimizer. (It should not be confused with CodonFM, the separate codon-resolution language-model family from NVIDIA and the Arc Institute; CodonFM is built primarily for analysis and variant interpretation, whereas CodonTranslator is a conditional generator focused on optimization across and beyond its training species.)

Key Features

Conditional codon generation: The model autoregressively generates a coding sequence conditioned jointly on the target protein and the target species, producing synonymous designs that respect the requested organism's codon preferences while preserving the encoded protein.
Hierarchical species-lineage embeddings: Taxonomic lineage is embedded so that codon-usage knowledge transfers along evolutionary relationships, enabling reasonable designs for low-data and even out-of-distribution species beyond the 2,100+ in training.
Protein-context awareness: Conditioning on protein context (via protein representations) lets codon choice depend on the surrounding sequence rather than position-independent frequency tables.
Cross-domain coverage: Trained on coding sequences spanning all domains of life, the model targets prokaryotic and eukaryotic hosts within a single framework.
Controllable sampling: Inference supports temperature, top-k, and top-p nucleus sampling with amino-acid enforcement constraints, so generated DNA is guaranteed to back-translate to the requested protein.

Technical Details

CodonTranslator is a decoder-only Transformer with roughly 150 million parameters — reported as a hidden dimension of 750, 20 layers, 15 attention heads, and an MLP ratio of 3.2 — trained in BF16 with fully-sharded data parallelism. Training used over 62 million CDS–protein pairs drawn from 2,100+ species. The released dataset (CodonTranslator-data) provides representative-only train/validation/test shards (about 36.9M / 374K / 331K rows) de-duplicated by MMseqs clustering in protein space, with splits engineered to eliminate exact-protein leakage and to hold out entire species for testing — so reported generalization reflects genuinely unseen organisms rather than memorized sequences. Precomputed species-conditioning embeddings and a taxonomy database accompany the data. According to the preprint, generated designs meet or exceed existing optimization methods on codon-usage metrics and on predicted mRNA stability; the public model card does not yet tabulate these benchmark numbers directly.

Applications

CodonTranslator targets the everyday problem of expressing a protein of interest in a chosen host. Protein engineers and synthetic biologists can request a coding sequence for a specific organism and receive a species-tailored design without curating codon tables by hand, which is useful for recombinant protein production, metabolic-pathway engineering, and synthetic-gene construction. The lineage conditioning is particularly valuable for non-model and rare organisms where reference codon statistics are sparse, and the protein-context conditioning suits cases where local sequence features influence expression. Because amino-acid identity is enforced during sampling, every output is a valid synonymous design, which lowers the barrier to plugging the tool into existing gene-synthesis pipelines.

Impact

By casting codon optimization as conditional generation over the full diversity of life, CodonTranslator moves the field beyond single-host frequency heuristics toward learned, transferable codon grammar. Its open release — inference-ready weights under an MIT license on HuggingFace, a documented training dataset, and accompanying code — makes it readily usable and reproducible by the synthetic-biology community. As a preprint released in late 2025, its real-world expression gains await independent wet-lab validation, and like other coding-sequence models it does not model untranslated regions that also govern expression. Its main contribution is demonstrating that explicit species-lineage conditioning enables a single generative model to optimize codons across, and beyond, the species it was trained on.

Key Features

Conditional codon generation: The model autoregressively generates a coding sequence conditioned jointly on the target protein and the target species, producing synonymous designs that respect the requested organism's codon preferences while preserving the encoded protein.

Hierarchical species-lineage embeddings: Taxonomic lineage is embedded so that codon-usage knowledge transfers along evolutionary relationships, enabling reasonable designs for low-data and even out-of-distribution species beyond the 2,100+ in training.

Protein-context awareness: Conditioning on protein context (via protein representations) lets codon choice depend on the surrounding sequence rather than position-independent frequency tables.

Cross-domain coverage: Trained on coding sequences spanning all domains of life, the model targets prokaryotic and eukaryotic hosts within a single framework.

Controllable sampling: Inference supports temperature, top-k, and top-p nucleus sampling with amino-acid enforcement constraints, so generated DNA is guaranteed to back-translate to the requested protein.

Technical Details

Applications

Impact

CodonTranslator

Key Features

Technical Details

Applications

Impact

Citation