Overview

CaLM (Codon adaptation Language Model) is a BERT-style language model developed by Carlos Outeiral and Charlotte M. Deane at the University of Oxford that learns from protein-coding DNA sequences at the codon level rather than from amino acid sequences. This single architectural choice unlocks biological information that is erased by translation: codon usage bias, synonymous codon preferences, organism-specific adaptation, and genomic signals linked to translational efficiency and mRNA stability. Published in Nature Machine Intelligence in 2024, CaLM demonstrates that the choice of input representation can matter more than model scale.

At just 86 million parameters, CaLM outperforms protein language models with over 4 billion parameters on several downstream tasks, including protein abundance prediction, transcript abundance prediction, species recognition, and melting point estimation. This result challenges the assumption that larger amino acid-based models are the default best choice for all protein engineering applications and opens a complementary avenue of codon-aware representation learning.

Key Features

Codon-level tokenization: Sequences are tokenized as codon triplets rather than amino acids, allowing the model to learn synonymous codon usage patterns, GC content, and regulatory signals encoded in DNA but absent from protein sequences.
Exceptional parameter efficiency: With 85.75 million parameters, CaLM surpasses models exceeding 4 billion parameters on multiple tasks, demonstrating that representation choice can substitute for scale.
BERT-style masked language modeling: Pre-trained with a 25% masking rate using the standard MLM objective, with 80% mask tokens, 10% random codon substitutions, and 10% unchanged positions.
Large curated training corpus: Trained on 9.86 million non-redundant coding DNA sequences filtered from 114 million raw sequences in the European Nucleotide Archive (April 2022 snapshot), with CD-HIT clustering at 40% sequence identity to reduce redundancy.
Broad task coverage: Supports sequence classification, sequence regression, token-level prediction, and contact prediction, making it suitable for a wide range of protein engineering workflows.

Technical Details

CaLM uses a 12-layer transformer encoder architecture with 12 attention heads, a hidden dimension of 768, an intermediate feedforward size of 3,072, and rotary positional embeddings. The maximum input length is 1,024 codon tokens, corresponding to sequences of up to 3,072 nucleotides. The model was trained for 14 epochs using the AdamW optimizer with a learning rate of 1e-4 and a cosine decay schedule with 1,000 warmup steps, using a batch size of 1,000 sequences on four NVIDIA Quadro RTX 4000 GPUs (8 GB each).

The training corpus was constructed from the European Nucleotide Archive by filtering for sequences that begin with ATG, contain no internal stop codons, have no unknown nucleotides, and are divisible by three. After these quality filters, 9.86 million sequences remained from the original 114 million. CD-HIT was applied at 40% sequence identity to reduce redundancy before pre-training.

Applications

CaLM is particularly well suited to protein engineering tasks where expression, production efficiency, or host organism context matter. Researchers use it for codon optimization — selecting among synonymous codons to maximize expression in a target host — and for predicting steady-state protein and transcript abundance without wet-lab experiments. Its embeddings support melting point prediction for thermostability-guided design and species-of-origin classification for metagenomic and synthetic biology applications. Because CaLM captures signals that amino acid models cannot, it complements rather than replaces standard protein language models; combining CaLM embeddings with amino acid-level representations is a natural direction for tasks where both genomic context and amino acid chemistry are relevant.

Impact

CaLM's publication in Nature Machine Intelligence established that codon-level language modeling is a productive and underexplored direction in biological AI. Its demonstration that an 86M-parameter model can outperform billion-parameter protein language models on biologically meaningful tasks has prompted renewed interest in input representation design as a lever for model performance. The work is particularly significant for synthetic biology and biomanufacturing communities, where protein expression efficiency is a primary engineering objective. A key limitation is that CaLM operates on single coding sequences and does not model multi-gene or genomic context, and its maximum sequence length of 1,024 codons excludes very long coding sequences. Its performance on structure-centric tasks where amino acid chemistry dominates is also expected to be weaker than on expression-related tasks where codon usage is the primary signal.

Overview

Key Features

Codon-level tokenization: Sequences are tokenized as codon triplets rather than amino acids, allowing the model to learn synonymous codon usage patterns, GC content, and regulatory signals encoded in DNA but absent from protein sequences.

Exceptional parameter efficiency: With 85.75 million parameters, CaLM surpasses models exceeding 4 billion parameters on multiple tasks, demonstrating that representation choice can substitute for scale.

BERT-style masked language modeling: Pre-trained with a 25% masking rate using the standard MLM objective, with 80% mask tokens, 10% random codon substitutions, and 10% unchanged positions.

Large curated training corpus: Trained on 9.86 million non-redundant coding DNA sequences filtered from 114 million raw sequences in the European Nucleotide Archive (April 2022 snapshot), with CD-HIT clustering at 40% sequence identity to reduce redundancy.

Broad task coverage: Supports sequence classification, sequence regression, token-level prediction, and contact prediction, making it suitable for a wide range of protein engineering workflows.

Technical Details

Applications

Impact

CaLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Codon language embeddings provide strong signals for use in protein engineering

Metrics

GitHub

Citations

Tags

Resources

CaLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Codon language embeddings provide strong signals for use in protein engineering

Metrics

GitHub

Citations

Tags

Resources