CaLM (Codon adaptation Language Model) is a BERT-style language model developed by Carlos Outeiral and Charlotte M. Deane at the University of Oxford that learns from protein-coding DNA sequences at the codon level rather than from amino acid sequences. This single architectural choice unlocks biological information that is erased by translation: codon usage bias, synonymous codon preferences, organism-specific adaptation, and genomic signals linked to translational efficiency and mRNA stability. Published in Nature Machine Intelligence in 2024, CaLM demonstrates that the choice of input representation can matter more than model scale.
At just 86 million parameters, CaLM outperforms protein language models with over 4 billion parameters on several downstream tasks, including protein abundance prediction, transcript abundance prediction, species recognition, and melting point estimation. This result challenges the assumption that larger amino acid-based models are the default best choice for all protein engineering applications and opens a complementary avenue of codon-aware representation learning.
CaLM uses a 12-layer transformer encoder architecture with 12 attention heads, a hidden dimension of 768, an intermediate feedforward size of 3,072, and rotary positional embeddings. The maximum input length is 1,024 codon tokens, corresponding to sequences of up to 3,072 nucleotides. The model was trained for 14 epochs using the AdamW optimizer with a learning rate of 1e-4 and a cosine decay schedule with 1,000 warmup steps, using a batch size of 1,000 sequences on four NVIDIA Quadro RTX 4000 GPUs (8 GB each).
The training corpus was constructed from the European Nucleotide Archive by filtering for sequences that begin with ATG, contain no internal stop codons, have no unknown nucleotides, and are divisible by three. After these quality filters, 9.86 million sequences remained from the original 114 million. CD-HIT was applied at 40% sequence identity to reduce redundancy before pre-training.
CaLM is particularly well suited to protein engineering tasks where expression, production efficiency, or host organism context matter. Researchers use it for codon optimization — selecting among synonymous codons to maximize expression in a target host — and for predicting steady-state protein and transcript abundance without wet-lab experiments. Its embeddings support melting point prediction for thermostability-guided design and species-of-origin classification for metagenomic and synthetic biology applications. Because CaLM captures signals that amino acid models cannot, it complements rather than replaces standard protein language models; combining CaLM embeddings with amino acid-level representations is a natural direction for tasks where both genomic context and amino acid chemistry are relevant.
CaLM's publication in Nature Machine Intelligence established that codon-level language modeling is a productive and underexplored direction in biological AI. Its demonstration that an 86M-parameter model can outperform billion-parameter protein language models on biologically meaningful tasks has prompted renewed interest in input representation design as a lever for model performance. The work is particularly significant for synthetic biology and biomanufacturing communities, where protein expression efficiency is a primary engineering objective. A key limitation is that CaLM operates on single coding sequences and does not model multi-gene or genomic context, and its maximum sequence length of 1,024 codons excludes very long coding sequences. Its performance on structure-centric tasks where amino acid chemistry dominates is also expected to be weaker than on expression-related tasks where codon usage is the primary signal.
Outeiral, C., & Deane, C. M. (2024). Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2), 170-179.
DOI: 10.1038/s42256-024-00791-0