Overview

GPN (Genomic Pre-trained Network) is a DNA language model developed by Gonzalo Benegas, Sanjit Singh Batra, and Yun S. Song at UC Berkeley's Song Lab. It was designed to address a fundamental gap in variant effect prediction: the inability to assess the functional impact of genetic variants genome-wide without relying on expensive, cell-type-specific functional genomics data. GPN learns representations of genomic sequence through self-supervised pretraining on raw DNA from multiple species, then uses those representations to score variants as a zero-shot log-likelihood ratio between the alternate and reference allele — no labeled training data required.

The core innovation of GPN is its application of masked language modeling, widely used in protein language models, to entire genomes rather than to coding sequences alone. By training on unaligned reference genomes from a focal species and closely related species, GPN captures the evolutionary signatures of constraint that distinguish functionally important positions from neutral variation. The original model was demonstrated on Arabidopsis thaliana and seven related Brassicales species, achieving state-of-the-art performance at predicting the deleteriousness of variants across the full genome — including non-coding regions that make up the vast majority of eukaryotic DNA.

A subsequent extension, GPN-MSA, incorporated whole-genome multiple-sequence alignments and a transformer architecture to scale to the human genome (hg38), achieving top performance across coding and non-coding variant benchmarks. GPN-MSA was published in Nature Biotechnology in early 2025. The GPN framework, including code for training models from any reference genome, is freely available on GitHub.

Key Features

Unsupervised variant effect prediction: GPN scores variants using a log-likelihood ratio derived from masked language modeling — no functional annotation labels, ENCODE tracks, or cell-type-specific assay data required at inference time.
Genome-wide coverage: Unlike enhancer or promoter-specific models, GPN operates across the full genome, including intronic, intergenic, and regulatory non-coding regions that account for ~98% of the human genome.
Multispecies pretraining for evolutionary constraint: Training on genomes from multiple related species allows GPN to implicitly learn which positions are under purifying selection, providing the evolutionary signal that drives variant deleteriousness prediction.
Alignment-free and alignment-based variants: The original GPN trains on unaligned genomes using a convolutional architecture, while GPN-MSA incorporates whole-genome multiple-sequence alignments into a transformer, capturing more precise cross-species positional correspondence.
Extensible to any species: The GPN codebase is designed to allow users to train a model for any organism given only its reference genome sequence and related species genomes, lowering the barrier to studying variant effects in non-human species.
State-of-the-art human genome performance: GPN-MSA (200M parameters, trained on vertebrate, mammalian, or primate alignments) achieves top-tier performance on ClinVar pathogenicity classification, gnomAD rare-variant enrichment, COSMIC cancer variants, deep mutational scanning assays, and DepMap fitness data.

Technical Details

The original GPN model uses a deep convolutional neural network (ConvNet) with residual connections. It takes a 512-bp DNA sequence as input, processes it through stacked convolutional layers to produce a high-dimensional contextual embedding at each nucleotide position, and outputs four nucleotide probabilities at masked positions via a final projection layer. During training, 15% of positions are randomly masked; during inference for variant effect prediction, only the variant position is masked. The model is implemented within the HuggingFace Transformers framework, supporting both the ConvNetEncoder and a transformer-based RoFormerEncoder with rotary position embeddings.

GPN-MSA extends this framework to whole-genome multiple-sequence alignments, feeding columns of aligned nucleotides across species into a transformer network. This allows the model to directly encode cross-species conservation at each genomic position. GPN-MSA models for the human genome are available at three phylogenetic scales: vertebrate, mammalian, and primate, each corresponding to 200M parameters and trained on the hg38 reference with Zoonomia-derived alignments. GPN-Star, a further development in the same framework, trains on star-topology alignments and achieves state-of-the-art results across variant effect prediction benchmarks spanning both coding missense variants (ClinVar, COSMIC) and non-coding regulatory variants (OMIM, gnomAD enrichment, splicing effects).

Applications

GPN is designed for researchers studying the functional consequences of genetic variation, particularly in contexts where experimental data is sparse or absent. Human geneticists can use GPN-MSA scores to prioritize variants of uncertain significance from clinical sequencing, complementing tools such as CADD, Enformer, and SpliceAI. Population geneticists can use the zero-shot scores to investigate signatures of natural selection or to rank putatively deleterious rare variants from genome-wide association studies. Plant biologists and model-organism researchers can train species-specific GPN models from publicly available genome assemblies to gain variant-level functional insight without access to organism-specific functional genomics datasets. Because GPN requires no labeled training data and covers the entire genome, it is particularly valuable for non-coding variant interpretation — a problem where supervised models are often limited by the availability and cell-type specificity of their training labels.

Impact

GPN established that DNA language models trained through self-supervised masked language modeling can serve as competitive, genome-wide variant effect predictors without functional annotation. The original PNAS paper (2023) demonstrated that evolutionary information encoded by multispecies pretraining is sufficient to identify deleterious variants across the full genome of Arabidopsis thaliana, and GPN-MSA extended this result to the human genome in a paper published in Nature Biotechnology (2025). These results are notable because they challenge the assumption that large supervised datasets of functional measurements are necessary for accurate non-coding variant scoring. The publicly released code and pretrained models lower the barrier for academic labs to apply genomic language models to their own variant interpretation problems. Current limitations include sensitivity to the phylogenetic breadth of training species — models trained on divergent alignments may capture different aspects of constraint — and the absence of cell-type-specific regulatory context, which supervised approaches using chromatin accessibility or histone modification data can provide.

Citations

DOI: 10.1073/pnas.2311219120

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.

DOI: 10.1038/s41587-024-02511-w

Overview

Key Features

Unsupervised variant effect prediction: GPN scores variants using a log-likelihood ratio derived from masked language modeling — no functional annotation labels, ENCODE tracks, or cell-type-specific assay data required at inference time.

Genome-wide coverage: Unlike enhancer or promoter-specific models, GPN operates across the full genome, including intronic, intergenic, and regulatory non-coding regions that account for ~98% of the human genome.

Multispecies pretraining for evolutionary constraint: Training on genomes from multiple related species allows GPN to implicitly learn which positions are under purifying selection, providing the evolutionary signal that drives variant deleteriousness prediction.

Alignment-free and alignment-based variants: The original GPN trains on unaligned genomes using a convolutional architecture, while GPN-MSA incorporates whole-genome multiple-sequence alignments into a transformer, capturing more precise cross-species positional correspondence.

Extensible to any species: The GPN codebase is designed to allow users to train a model for any organism given only its reference genome sequence and related species genomes, lowering the barrier to studying variant effects in non-human species.

State-of-the-art human genome performance: GPN-MSA (200M parameters, trained on vertebrate, mammalian, or primate alignments) achieves top-tier performance on ClinVar pathogenicity classification, gnomAD rare-variant enrichment, COSMIC cancer variants, deep mutational scanning assays, and DepMap fitness data.

Technical Details

Applications

Impact

Citations

DOI: 10.1073/pnas.2311219120

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.

DOI: 10.1038/s41587-024-02511-w

GPN

Overview

Key Features

Technical Details

Applications

Impact

Citations

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Metrics

GitHub

Tags

Resources

GPN

Overview

Key Features

Technical Details

Applications

Impact

Citations

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Metrics

GitHub

Tags

Resources