NVIDIA / Arc Institute
A family of codon-resolution language models trained on 130 million coding sequences from 20,000 species, revealing context-dependent codon grammar governing translation and mRNA stability.
The genetic code maps 64 possible codons to 20 amino acids plus stop signals, and because most amino acids are encoded by multiple synonymous codons, the same protein sequence can be produced by an enormous number of different coding DNA or RNA sequences. For decades, synonymous codons were assumed to be largely interchangeable — a seductively simple view that has been progressively undermined by experimental evidence showing that codon usage influences mRNA stability, translation speed and accuracy, co-translational protein folding, and the regulation of gene expression. Organisms across the tree of life show non-random codon usage biases that reflect evolutionary optimization of these properties, yet the rules governing codon choice have proven difficult to extract from first principles or statistical analysis alone.
CodonFM is a family of codon-resolution language models developed through a collaboration between NVIDIA's Digital Biology Research Lab and the Arc Institute, announced publicly at NVIDIA GTC in Washington D.C. in April 2025. The models are trained directly on codon sequences — not amino acid sequences and not raw nucleotides — treating each codon as a single token and learning a statistical language over the space of protein-coding sequences. By operating at codon resolution, CodonFM preserves synonymous variation that is invisible to protein language models (which only see amino acids) while learning from more structured, semantically meaningful units than raw nucleotide language models that treat individual bases as tokens.
The result is a family of models that reveals context-dependent patterns in codon usage that correlate with translation efficiency, mRNA abundance, and cellular function — providing what the developers describe as a grammar of codon translation. The CodonFM family comprises two complementary architectures — Encodon (bidirectional BERT-style) and Decodon (autoregressive GPT-style) — available in three scales (80M, 600M, and 1B parameters), and is fully open-sourced on GitHub and HuggingFace, with training conducted on NVIDIA infrastructure using the NeMo framework.
Codon-resolution tokenization: Each codon (three nucleotides encoding one amino acid or a stop signal) is treated as a single vocabulary token, giving the model access to synonymous variation that protein language models discard. This tokenization allows CodonFM to model which specific codon is used at each position in a coding sequence, not just which amino acid is encoded.
Dual architecture family — Encodon and Decodon: Encodon uses a bidirectional BERT-style architecture with masked codon modeling pretraining, processing the entire coding sequence simultaneously and capturing both upstream and downstream contextual dependencies. Decodon uses an autoregressive GPT-style architecture trained with causal language modeling, predicting the next codon from all preceding codons and enabling generative sequence design. The two architectures are complementary: Encodon excels at analysis and prediction tasks, Decodon at sequence generation and optimization.
Trained on 130 million coding sequences from 22,000 species: The pretraining corpus was derived from the NCBI RefSeq database, covering protein-coding sequences across the full diversity of life — bacteria, archaea, and eukaryotes. Training across 22,000 species allows CodonFM to capture universal features of codon usage as well as lineage-specific adaptations, providing a cross-species comparative perspective on codon grammar.
Long-context window for full-gene modeling: Encodon supports a context window of 2,046 codon tokens (corresponding to 6,138 nucleotides), sufficient to model the complete coding sequences of most human and microbial genes in a single forward pass. This long-range context enables the model to capture codon-usage dependencies that span entire genes, not just local sequence neighborhoods.
Synonymous variant interpretation: Unlike protein language models that assign the same representation to all synonymous codons encoding the same amino acid, CodonFM distinguishes between synonymous codons by their context-dependent usage patterns. This enables interpretation of synonymous variants — changes that do not alter the encoded protein sequence but may affect mRNA stability, ribosome stalling, or translational regulation — a class of variants largely invisible to existing variant effect predictors.
Three model scales with scale-dependent accuracy: The model family is available at 80M, 600M, and 1B parameters. Larger models more accurately distinguish between synonymous codons that encode the same amino acid, demonstrating that the statistical patterns underlying codon grammar are subtle enough to require significant model capacity to capture reliably.
Application to pathogenic missense variant detection: The fine-tuned 1B-parameter Encodon model achieves competitive performance on pathogenic missense mutation detection benchmarks, demonstrating that codon-level representations encode medically relevant information about which amino acid substitutions are likely to be deleterious. The model extends this capability to synonymous variants, predicting translation-level effects from codon context.
Encodon, the bidirectional member of the CodonFM family, follows a BERT-style architecture: a stack of transformer encoder layers with multi-head self-attention, trained with a masked codon modeling objective. During pretraining, a fraction of codons in each sequence are randomly masked, and the model is trained to predict the masked codons from the remaining context. This objective forces the model to learn the statistical dependencies between codon choices at different positions in a gene — the codon grammar that CodonFM aims to capture. The context window of 2,046 codon tokens (6,138 nucleotides) is substantially longer than earlier codon-aware models, enabling modeling of long-range dependencies across complete coding sequences.
Decodon, the autoregressive member, uses a GPT-style decoder-only transformer trained with a causal language modeling objective: given all codons preceding position i in the sequence, predict the codon at position i. This left-to-right generation scheme enables Decodon to generate novel coding sequences codon by codon, making it applicable to sequence design tasks such as codon optimization for recombinant protein expression, synthetic gene design, or codon-usage-guided vaccine antigen engineering.
The pretraining corpus was assembled from NCBI RefSeq protein-coding sequences across 22,000 species spanning all three domains of life, totaling over 130 million coding sequences. Data processing included quality filtering, deduplication at the sequence level, and conversion to codon-tokenized format. Training was conducted on NVIDIA GPU infrastructure using the NeMo framework, with training optimized through cuDNN and cuBLAS libraries for matrix operations and memory-mapped file I/O for efficient data streaming. The 1B-parameter Encodon model was trained on hundreds of billions of codon tokens, following scaling trends established in large language model research.
Downstream evaluations demonstrated that Encodon embeddings capture biologically meaningful properties of coding sequences beyond amino acid identity. The model correctly assigns lower probabilities to rare or usage-biased codons in context-dependent ways, and fine-tuning on pathogenic variant datasets yields a model capable of distinguishing disease-associated from benign amino acid substitutions with performance comparable to dedicated variant effect predictors trained at the protein sequence level. Critically, the model also performs non-trivially on synonymous variants — a category for which protein-level models are uninformative by construction.
CodonFM addresses a range of applied problems in RNA biology, synthetic biology, and molecular medicine where codon-level sequence properties matter. Codon optimization is a standard workflow in recombinant protein production and mRNA therapeutics: when a human protein is expressed in bacteria, yeast, or mammalian cells for research or pharmaceutical purposes, the native codon usage may be poorly matched to the host organism's translation machinery, reducing expression levels. Decodon can generate optimized synonymous sequence variants predicted to improve expression in specific host systems based on their codon usage patterns. In mRNA vaccine and therapeutics development, codon optimization of antigen-encoding sequences is a critical manufacturing step; CodonFM's understanding of codon grammar provides a principled, data-driven basis for optimization rather than simple frequency-matching heuristics. For basic research, Encodon embeddings provide a representation of coding sequences that accounts for synonymous variation, enabling comparative analyses of codon usage across genes, organisms, or evolutionary lineages. In clinical genomics, the model's demonstrated ability to interpret synonymous variants opens a new avenue for variant effect prediction: synonymous mutations have historically been undervalued as potential disease mechanisms, but evidence is accumulating that they can disrupt splicing, reduce mRNA stability, alter translation speed, or impair co-translational folding. Encodon's context-dependent codon representations provide a computational tool for prioritizing synonymous variants of potential functional consequence. Collaborators at Therna Biosciences, Greenstone Biosciences, and Moonwalk Biosciences participated in early access testing, indicating industrial interest in therapeutic and research applications.
CodonFM establishes a new level of resolution for biological sequence language models by operating at the codon — the natural unit of translation — rather than at amino acids or raw nucleotides. The demonstration that larger CodonFM models more accurately distinguish synonymous codons confirms that codon grammar is a real, learnable statistical property of coding sequences, and that it is complex enough to benefit from large-scale modeling. This supports and deepens the biological hypothesis that codon usage is not random but is shaped by evolutionary selection for translational properties, providing a computational tool to study this phenomenon systematically across the tree of life. The open release of models at three scales on HuggingFace, with code on GitHub and integration into the CZI Virtual Cell Platform, makes CodonFM broadly accessible to researchers in computational biology, synthetic biology, and genomic medicine. The collaboration between NVIDIA's engineering expertise (NeMo training infrastructure, GPU optimization) and Arc Institute's biological domain knowledge represents a productive model for large-scale biological foundation model development. A key limitation is that pretraining is restricted to protein-coding sequences, meaning untranslated regions (UTRs), which play important roles in mRNA stability and translational regulation, are not currently modeled. Additionally, codon usage patterns vary substantially across cell types, growth conditions, and stress responses in ways that are not captured by species-level sequence statistics; cell-type-specific codon usage models may be needed for precision applications in specific biological contexts. The preprint describing CodonFM in detail is available from NVIDIA Research, and the models are actively being used by early-access collaborators to explore applications in mRNA therapeutics design and variant interpretation.