
DNA & Gene Models
Genomic sequence modeling and gene expression analysis
106 models in this category
What DNA & gene foundation models do
DNA and gene foundation models learn the regulatory and functional grammar of genomic sequences, predicting how nucleotide changes propagate through gene regulatory networks to alter expression, splicing, and cellular phenotype. Models like Enformer predict cell-type-specific gene expression tracks from sequence alone, while Evo and the Nucleotide Transformer learn broader representations spanning prokaryotic and eukaryotic genomes. DNABERT and its successors apply BERT-style masking to DNA k-mers, enabling fine-tuning for tasks from promoter classification to variant effect prediction.
Common applications and use cases
Variant effect prediction is among the highest-value applications: models like Enformer and Sei can score non-coding variants for regulatory impact, informing GWAS interpretation and rare disease diagnosis. Regulatory element classification — identifying enhancers, promoters, and silencers — and CRISPR guide efficiency scoring are other well-established use cases. Evo's pretraining at genomic scale also supports sequence generation tasks, including the design of novel regulatory elements and protein-coding sequences.
Notable Models
Top-rated dna & gene models from our evaluations
Sparse attention transformer extending BERT to sequences up to 8x longer via random, local, and global attention patterns, with demonstrated applications in genomic sequence modeling.
Multi-species genomic foundation model replacing k-mer tokenization with BPE, achieving state-of-the-art performance with 21x fewer parameters than prior leading models.
An open autoregressive genomic foundation model (0.5B–8B params) with a 6-mer DNA tokenizer, matching Evo2-7B win rates at far higher throughput.
Transformer model that predicts gene expression and regulatory activity from 200kb DNA sequences, capturing enhancer-promoter interactions up to 100kb away.
Bidirectional, reverse-complement equivariant DNA language models built on Mamba SSMs. Outperforms models 10x larger on long-range variant effect prediction.
Frequently asked questions
What is a DNA and gene foundation model?
A DNA and gene foundation model is a neural network pretrained on large collections of genomic sequences — DNA or RNA — to learn representations of regulatory syntax, coding potential, and sequence function. These representations transfer to downstream tasks like variant effect prediction, gene expression modeling, and regulatory element classification. Examples include Enformer, Nucleotide Transformer, and Evo.
How do genomic foundation models handle the non-coding genome?
Most genomic foundation models are pretrained on whole-genome sequences that include non-coding regions, meaning they implicitly encode information about regulatory elements, transposons, and intergenic space. Models like Enformer were specifically designed to predict transcription factor binding and chromatin accessibility from non-coding sequence windows, making them well-suited to interpreting GWAS hits in regulatory regions.
What is the difference between DNABERT and Enformer?
DNABERT applies BERT-style masked language modeling to tokenized DNA k-mers, producing general-purpose genomic embeddings useful across many classification and regression tasks. Enformer is an architecture explicitly trained to predict genomic assay tracks (CAGE, ATAC-seq, ChIP-seq) from long input windows of up to 200 kb, making it specifically powerful for gene expression and regulatory prediction rather than general-purpose sequence representation.
Can DNA foundation models predict variant pathogenicity?
Yes, this is one of the most actively developed applications. Models like Enformer, Sei, and Nucleotide Transformer can score the predicted functional impact of single-nucleotide variants by comparing reference and alternate sequence outputs. However, most current models are stronger at regulatory variants in well-characterized tissues than at rare coding variants, and calibration against clinical databases remains an active area of evaluation.