A family of transformer-based DNA language models supporting context lengths up to 36,000 bp via BPE tokenization and BigBird sparse attention.
GENA-LM is a family of open-source transformer-based language models developed by the AIRI (Artificial Intelligence Research Institute) in Moscow, trained on raw DNA sequences to serve as general-purpose foundation models for genomics. First described as a preprint in June 2023 and published in Nucleic Acids Research in January 2025, GENA-LM addresses a fundamental bottleneck in genomic sequence modeling: most existing models were constrained to very short windows — a few hundred base pairs — due to the quadratic scaling of standard self-attention with sequence length. This limitation is consequential because many biologically relevant signals, such as promoter architecture, splicing decisions, and transcription factor cooperativity, span thousands to tens of thousands of nucleotides.
The family's defining design choice is the use of Byte-Pair Encoding (BPE) tokenization adapted for nucleotide sequences, combined with two complementary architectures for different context-length regimes. Standard BERT-style models handle sequences up to approximately 4,500 base pairs, while BigBird-based variants with sparse attention extend that window to roughly 36,000 base pairs. This tokenization strategy produces tokens averaging 9 base pairs each, enabling the same 512-token budget that covers 512 nucleotides in character-level models to instead cover an entire gene locus.
GENA-LM was benchmarked against leading DNA foundation models including DNABERT, DNABERT-2, the Nucleotide Transformer, and HyenaDNA across a broad panel of tasks including transcription factor binding prediction, promoter identification, splice site detection, and chromatin accessibility profiling. Across most benchmarks, GENA-LM models matched or exceeded prior state-of-the-art performance, with the 336M-parameter BERT-large variant outperforming the 2.5-billion-parameter Nucleotide Transformer on an 18-task benchmark suite.
GENA-LM models are pre-trained with a masked language modeling (MLM) objective, masking 15% of BPE tokens during training. The BERT-base variants follow the standard 12-layer transformer encoder design (110M parameters), while the BERT-large variant scales to 24 layers and 336M parameters. All models use pre-layer normalization and rotary position embeddings in place of the absolute positional encodings from original BERT, which improves generalization to variable-length sequences. BigBird-based models retain 12 layers (~110M parameters) but replace dense self-attention with a sparse attention pattern that scales linearly with sequence length, supporting 4,096 tokens (~36 kb of DNA) per forward pass.
Training data combined the T2T v2 human assembly with a multispecies corpus drawn from 32 Arabidopsis thaliana ecotypes, 142 yeast strains, 298 Drosophilid species, and additional organisms from Ensembl release 106, totaling roughly 1 trillion base pairs. Pre-training ran for 1–2 million steps using batch size 256 on 8–16 NVIDIA A100 GPUs. On the 18-task Nucleotide Transformer benchmark suite, gena-lm-bert-large-t2t achieved a mean Matthews Correlation Coefficient (MCC) of 0.707, versus 0.691 for the 2.5B-parameter Nucleotide Transformer — a 2% improvement at 7.6-fold lower parameter count. On chromatin profiling tasks from the DeepSEA dataset, the BigBird variant reached an ROC-AUC of 96.81 on transcription factor binding prediction at 1-kb context, and the BERT-large variant reached 92.8 AUC on DNase I hypersensitivity prediction.
GENA-LM is well-suited for researchers working on regulatory genomics who need a pretrained encoder that handles longer genomic context than older short-context models permit. Immediate applications include transcription factor binding site prediction, promoter and enhancer identification, splice site annotation, and chromatin state inference, all of which benefit from the model's capacity to integrate signals distributed over several kilobases. The model family also serves as a starting point for transfer learning on custom datasets where labeled data are scarce but sequence information is abundant. The publicly hosted web service at dnalm.airi.net provides inference without requiring local GPU resources, which lowers the barrier for wet-lab biologists who want to interrogate specific loci without setting up a deep learning environment. The availability of all model weights on HuggingFace under a CC-BY-NC-ND 4.0 license further supports integration into academic research pipelines.
GENA-LM is one of a small number of open-source DNA foundation model families that directly addresses the long-context limitation that constrained earlier models like DNABERT. Its publication in Nucleic Acids Research provides peer-reviewed validation of results that were initially circulated as a preprint, and the concurrent release of model weights, training code, and fine-tuning examples through HuggingFace and GitHub has enabled adoption across regulatory genomics research groups. The model's efficiency relative to its scale — outperforming the much larger Nucleotide Transformer at a fraction of the parameter count — reflects the impact of domain-appropriate tokenization on genomic tasks. A notable limitation is that the current model family was primarily trained and validated on human sequence, and performance on highly divergent genomes (invertebrates, plants, microbes) decreases with evolutionary distance from the training distribution. The use of a non-commercial license (CC-BY-NC-ND 4.0) also restricts industrial applications without separate agreements with AIRI.