A DNA language model for unsupervised genome-wide variant effect prediction, trained on multispecies genomes via masked language modeling without functional annotation labels.
GPN (Genomic Pre-trained Network) is a DNA language model developed by Gonzalo Benegas, Sanjit Singh Batra, and Yun S. Song at UC Berkeley's Song Lab. It was designed to address a fundamental gap in variant effect prediction: the inability to assess the functional impact of genetic variants genome-wide without relying on expensive, cell-type-specific functional genomics data. GPN learns representations of genomic sequence through self-supervised pretraining on raw DNA from multiple species, then uses those representations to score variants as a zero-shot log-likelihood ratio between the alternate and reference allele — no labeled training data required.
The core innovation of GPN is its application of masked language modeling, widely used in protein language models, to entire genomes rather than to coding sequences alone. By training on unaligned reference genomes from a focal species and closely related species, GPN captures the evolutionary signatures of constraint that distinguish functionally important positions from neutral variation. The original model was demonstrated on Arabidopsis thaliana and seven related Brassicales species, achieving state-of-the-art performance at predicting the deleteriousness of variants across the full genome — including non-coding regions that make up the vast majority of eukaryotic DNA.
A subsequent extension, GPN-MSA, incorporated whole-genome multiple-sequence alignments and a transformer architecture to scale to the human genome (hg38), achieving top performance across coding and non-coding variant benchmarks. GPN-MSA was published in Nature Biotechnology in early 2025. The GPN framework, including code for training models from any reference genome, is freely available on GitHub.
The original GPN model uses a deep convolutional neural network (ConvNet) with residual connections. It takes a 512-bp DNA sequence as input, processes it through stacked convolutional layers to produce a high-dimensional contextual embedding at each nucleotide position, and outputs four nucleotide probabilities at masked positions via a final projection layer. During training, 15% of positions are randomly masked; during inference for variant effect prediction, only the variant position is masked. The model is implemented within the HuggingFace Transformers framework, supporting both the ConvNetEncoder and a transformer-based RoFormerEncoder with rotary position embeddings.
GPN-MSA extends this framework to whole-genome multiple-sequence alignments, feeding columns of aligned nucleotides across species into a transformer network. This allows the model to directly encode cross-species conservation at each genomic position. GPN-MSA models for the human genome are available at three phylogenetic scales: vertebrate, mammalian, and primate, each corresponding to 200M parameters and trained on the hg38 reference with Zoonomia-derived alignments. GPN-Star, a further development in the same framework, trains on star-topology alignments and achieves state-of-the-art results across variant effect prediction benchmarks spanning both coding missense variants (ClinVar, COSMIC) and non-coding regulatory variants (OMIM, gnomAD enrichment, splicing effects).
GPN is designed for researchers studying the functional consequences of genetic variation, particularly in contexts where experimental data is sparse or absent. Human geneticists can use GPN-MSA scores to prioritize variants of uncertain significance from clinical sequencing, complementing tools such as CADD, Enformer, and SpliceAI. Population geneticists can use the zero-shot scores to investigate signatures of natural selection or to rank putatively deleterious rare variants from genome-wide association studies. Plant biologists and model-organism researchers can train species-specific GPN models from publicly available genome assemblies to gain variant-level functional insight without access to organism-specific functional genomics datasets. Because GPN requires no labeled training data and covers the entire genome, it is particularly valuable for non-coding variant interpretation — a problem where supervised models are often limited by the availability and cell-type specificity of their training labels.
GPN established that DNA language models trained through self-supervised masked language modeling can serve as competitive, genome-wide variant effect predictors without functional annotation. The original PNAS paper (2023) demonstrated that evolutionary information encoded by multispecies pretraining is sufficient to identify deleterious variants across the full genome of Arabidopsis thaliana, and GPN-MSA extended this result to the human genome in a paper published in Nature Biotechnology (2025). These results are notable because they challenge the assumption that large supervised datasets of functional measurements are necessary for accurate non-coding variant scoring. The publicly released code and pretrained models lower the barrier for academic labs to apply genomic language models to their own variant interpretation problems. Current limitations include sensitivity to the phylogenetic breadth of training species — models trained on divergent alignments may capture different aspects of constraint — and the absence of cell-type-specific regulatory context, which supervised approaches using chromatin accessibility or histone modification data can provide.
Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.
DOI: 10.1038/s41587-024-02511-w