Overview

GENA-LM is a family of open-source transformer-based language models developed by the AIRI (Artificial Intelligence Research Institute) in Moscow, trained on raw DNA sequences to serve as general-purpose foundation models for genomics. First described as a preprint in June 2023 and published in Nucleic Acids Research in January 2025, GENA-LM addresses a fundamental bottleneck in genomic sequence modeling: most existing models were constrained to very short windows — a few hundred base pairs — due to the quadratic scaling of standard self-attention with sequence length. This limitation is consequential because many biologically relevant signals, such as promoter architecture, splicing decisions, and transcription factor cooperativity, span thousands to tens of thousands of nucleotides.

The family's defining design choice is the use of Byte-Pair Encoding (BPE) tokenization adapted for nucleotide sequences, combined with two complementary architectures for different context-length regimes. Standard BERT-style models handle sequences up to approximately 4,500 base pairs, while BigBird-based variants with sparse attention extend that window to roughly 36,000 base pairs. This tokenization strategy produces tokens averaging 9 base pairs each, enabling the same 512-token budget that covers 512 nucleotides in character-level models to instead cover an entire gene locus.

GENA-LM was benchmarked against leading DNA foundation models including DNABERT, DNABERT-2, the Nucleotide Transformer, and HyenaDNA across a broad panel of tasks including transcription factor binding prediction, promoter identification, splice site detection, and chromatin accessibility profiling. Across most benchmarks, GENA-LM models matched or exceeded prior state-of-the-art performance, with the 336M-parameter BERT-large variant outperforming the 2.5-billion-parameter Nucleotide Transformer on an 18-task benchmark suite.

Key Features

BPE tokenization for long-range genomic context: Rather than using fixed k-mers or single-nucleotide tokens, GENA-LM applies Byte-Pair Encoding with a vocabulary of 32,000 subword units, compressing genomic sequences roughly 9-fold and enabling models to process sequences up to 36,000 bp within the same token budget as prior short-context models.
Two complementary architectures: The family includes BERT-based models (110M and 336M parameters) for tasks requiring high representational capacity at moderate context lengths (~4.5 kb), and BigBird-based models (110M parameters with sparse attention) for tasks where genomic context spanning tens of kilobases is necessary.
Telomere-to-telomere training genome: Models were pre-trained on the T2T v2 human genome assembly — a complete, gap-free reference unavailable during the training of earlier models — supplemented with 1000 Genomes Project SNP augmentation for genetic diversity, totaling approximately 480 billion base pairs of human sequence.
Recurrent Memory Transformer extension: An optional RMT (Recurrent Memory Transformer) wrapper segments arbitrarily long sequences into ~4.5 kb chunks and propagates memory tokens across chunks, enabling coherent representations over 16–50 kb inputs. This approach achieved 99.2% accuracy on species classification at 32 kb, compared to 93.4% for HyenaDNA.
Cross-species generalization: Models pre-trained on human sequence transfer effectively to other vertebrates and to some extent invertebrates, with promoter prediction F1 scores of approximately 0.95 in mammals and 0.85 in distant vertebrates such as zebrafish and chicken.
Interpretable predictions: Token importance scores from fine-tuned models recapitulate known transcription factor binding motifs (including ATF1, CTCF, and GATA2), providing a mechanistic check on model behavior.

Technical Details

GENA-LM models are pre-trained with a masked language modeling (MLM) objective, masking 15% of BPE tokens during training. The BERT-base variants follow the standard 12-layer transformer encoder design (110M parameters), while the BERT-large variant scales to 24 layers and 336M parameters. All models use pre-layer normalization and rotary position embeddings in place of the absolute positional encodings from original BERT, which improves generalization to variable-length sequences. BigBird-based models retain 12 layers (~110M parameters) but replace dense self-attention with a sparse attention pattern that scales linearly with sequence length, supporting 4,096 tokens (~36 kb of DNA) per forward pass.

Training data combined the T2T v2 human assembly with a multispecies corpus drawn from 32 Arabidopsis thaliana ecotypes, 142 yeast strains, 298 Drosophilid species, and additional organisms from Ensembl release 106, totaling roughly 1 trillion base pairs. Pre-training ran for 1–2 million steps using batch size 256 on 8–16 NVIDIA A100 GPUs. On the 18-task Nucleotide Transformer benchmark suite, gena-lm-bert-large-t2t achieved a mean Matthews Correlation Coefficient (MCC) of 0.707, versus 0.691 for the 2.5B-parameter Nucleotide Transformer — a 2% improvement at 7.6-fold lower parameter count. On chromatin profiling tasks from the DeepSEA dataset, the BigBird variant reached an ROC-AUC of 96.81 on transcription factor binding prediction at 1-kb context, and the BERT-large variant reached 92.8 AUC on DNase I hypersensitivity prediction.

Applications

GENA-LM is well-suited for researchers working on regulatory genomics who need a pretrained encoder that handles longer genomic context than older short-context models permit. Immediate applications include transcription factor binding site prediction, promoter and enhancer identification, splice site annotation, and chromatin state inference, all of which benefit from the model's capacity to integrate signals distributed over several kilobases. The model family also serves as a starting point for transfer learning on custom datasets where labeled data are scarce but sequence information is abundant. The publicly hosted web service at dnalm.airi.net provides inference without requiring local GPU resources, which lowers the barrier for wet-lab biologists who want to interrogate specific loci without setting up a deep learning environment. The availability of all model weights on HuggingFace under a CC-BY-NC-ND 4.0 license further supports integration into academic research pipelines.

Impact

GENA-LM is one of a small number of open-source DNA foundation model families that directly addresses the long-context limitation that constrained earlier models like DNABERT. Its publication in Nucleic Acids Research provides peer-reviewed validation of results that were initially circulated as a preprint, and the concurrent release of model weights, training code, and fine-tuning examples through HuggingFace and GitHub has enabled adoption across regulatory genomics research groups. The model's efficiency relative to its scale — outperforming the much larger Nucleotide Transformer at a fraction of the parameter count — reflects the impact of domain-appropriate tokenization on genomic tasks. A notable limitation is that the current model family was primarily trained and validated on human sequence, and performance on highly divergent genomes (invertebrates, plants, microbes) decreases with evolutionary distance from the training distribution. The use of a non-commercial license (CC-BY-NC-ND 4.0) also restricts industrial applications without separate agreements with AIRI.

Overview

Key Features

BPE tokenization for long-range genomic context: Rather than using fixed k-mers or single-nucleotide tokens, GENA-LM applies Byte-Pair Encoding with a vocabulary of 32,000 subword units, compressing genomic sequences roughly 9-fold and enabling models to process sequences up to 36,000 bp within the same token budget as prior short-context models.

Two complementary architectures: The family includes BERT-based models (110M and 336M parameters) for tasks requiring high representational capacity at moderate context lengths (~4.5 kb), and BigBird-based models (110M parameters with sparse attention) for tasks where genomic context spanning tens of kilobases is necessary.

Telomere-to-telomere training genome: Models were pre-trained on the T2T v2 human genome assembly — a complete, gap-free reference unavailable during the training of earlier models — supplemented with 1000 Genomes Project SNP augmentation for genetic diversity, totaling approximately 480 billion base pairs of human sequence.

Recurrent Memory Transformer extension: An optional RMT (Recurrent Memory Transformer) wrapper segments arbitrarily long sequences into ~4.5 kb chunks and propagates memory tokens across chunks, enabling coherent representations over 16–50 kb inputs. This approach achieved 99.2% accuracy on species classification at 32 kb, compared to 93.4% for HyenaDNA.

Cross-species generalization: Models pre-trained on human sequence transfer effectively to other vertebrates and to some extent invertebrates, with promoter prediction F1 scores of approximately 0.95 in mammals and 0.85 in distant vertebrates such as zebrafish and chicken.

Interpretable predictions: Token importance scores from fine-tuned models recapitulate known transcription factor binding motifs (including ATF1, CTCF, and GATA2), providing a mechanistic check on model behavior.

Technical Details

Applications

Impact

GENA-LM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

GENA-LM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources