BERT-based pre-trained model for DNA sequences using k-mer tokenization. Achieves state-of-the-art performance on promoter, splice site, and transcription factor binding prediction.
DNABERT is a pre-trained bidirectional transformer model designed to learn general-purpose representations of DNA sequences. Published in Bioinformatics in 2021 by Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V. Davuluri at Northwestern University and Stony Brook University, it was among the first models to apply the BERT pre-training paradigm directly to genomic sequence data. Prior to DNABERT, most computational genomics tools were task-specific: a separate model was built and trained for each prediction problem. DNABERT introduced a single, transferable model that, once pre-trained on unlabeled genomic DNA, could be fine-tuned with small amounts of labeled data to address a wide range of downstream tasks.
The key innovation in DNABERT is its k-mer tokenization strategy. Rather than treating individual nucleotides as tokens — a vocabulary of only four characters — the model represents DNA as overlapping k-mers (substrings of length k), yielding vocabularies of 64, 256, 1,024, or 4,096 tokens for k=3, 4, 5, or 6, respectively. This richer vocabulary allows the model to capture local sequence context directly at the token level, analogous to how wordpiece tokenization captures morphological structure in natural language. Separate pre-trained checkpoints are released for each k-mer size, and users select the k most appropriate for their target task and sequence length constraints.
Pre-training follows the standard masked language modeling (MLM) objective from BERT: k-mer tokens are randomly masked and the model is trained to reconstruct them from bidirectional context. An important modification for genomic sequences is contiguous masking — because adjacent k-mers share k−1 nucleotides, masking isolated tokens is trivial to recover from immediate neighbors. DNABERT instead masks contiguous spans, making the pre-training task genuinely informative. The resulting representations encode both local sequence composition and broader positional context within the genome.
DNABERT follows the BERT-base architecture: 12 transformer encoder layers, 768 hidden dimensions, 12 attention heads, and approximately 110 million parameters. The model is pre-trained on the human reference genome (GRCh38/hg38.p13), with training sequences drawn by both non-overlapping splitting and random sampling, constrained to lengths between 5 and 510 tokens. Pre-training ran for 120,000 steps with a batch size of 2,000 sequences. The masking schedule progressed from 15% masked k-mers in the first 100,000 steps to 20% in the final 20,000 steps, with a warmup learning rate schedule peaking at 4×10⁻⁴.
On the core benchmark tasks evaluated in the paper, DNABERT outperformed prior CNN-based and RNN-based methods across the board. For promoter identification, it improved classification of both TATA and non-TATA promoters as measured by accuracy, F1, and Matthews correlation coefficient (MCC). For transcription factor binding site prediction across 690 ENCODE ChIP-Seq datasets, DNABERT consistently improved accuracy, precision, recall, F1, MCC, and AUC relative to baselines. For splice site prediction, the model significantly outperformed competing methods on both donor and acceptor site classification in multiclass settings. Pre-trained weights for k=3, 4, 5, and 6 are available on HuggingFace under the zhihan1996/DNA_bert_* identifiers.
DNABERT is well suited for any sequence-based genomic classification or annotation task where labeled training data is scarce. Regulatory genomics researchers use it to identify transcription factor binding sites and characterize promoter architecture across thousands of experiments from public databases such as ENCODE. Computational biologists apply it to splice site prediction for genome annotation and alternative splicing analysis. The model's attention visualization capability supports motif discovery, making it useful for identifying putative regulatory elements in non-coding regions without requiring prior knowledge of binding sequence logos. Because the pre-training corpus is human genomic DNA, DNABERT is particularly effective for human genome applications, though cross-species fine-tuning to mouse has been demonstrated. Researchers working on functional variant prioritization — identifying which variants in non-coding regions are likely to perturb regulatory activity — have used DNABERT's attention scores to flag candidate functional sites.
DNABERT established the foundation model paradigm for DNA sequence analysis, demonstrating that a single pre-trained transformer could transfer across mechanistically distinct genomic prediction tasks — a result that had not been shown convincingly for DNA prior to this work. The paper has accumulated thousands of citations and directly motivated a subsequent generation of genomic language models. The authors' own follow-up, DNABERT-2 (ICLR 2024), replaced k-mer tokenization with byte-pair encoding (BPE), extended pre-training to multi-species genomes, and incorporated attention with linear biases (ALiBi) for improved long-sequence handling. Broader successors including the Nucleotide Transformer and HyenaDNA have extended the approach to larger training corpora and alternative architectures. A key limitation of the original DNABERT is its fixed context window of 512 tokens, which restricts the genomic span the model can process in a single pass — long-range regulatory interactions spanning tens of kilobases fall outside this window. The k-mer tokenization scheme also introduces a positional ambiguity: because adjacent k-mers overlap by k−1 nucleotides, the effective sequence resolution is lower than single-nucleotide models. These limitations have been explicitly addressed in DNABERT-2 and HyenaDNA, but DNABERT remains a well-validated and computationally accessible baseline.