Transformer-based DNA language model using whole-genome multispecies alignments for genome-wide variant effect prediction across coding and non-coding regions.
GPN-MSA (Genomic Pre-trained Network with Multiple-Sequence Alignment) is a DNA language model developed by Yun S. Song's group at UC Berkeley that addresses a long-standing gap in genomic AI: the inability of prior DNA models to achieve strong variant effect prediction across both coding and non-coding regions of complex genomes like the human genome. The model was published as a preprint in October 2023 and subsequently appeared in Nature Biotechnology in December 2024.
Where earlier DNA language models such as Nucleotide Transformer were trained on single-species sequences and required weeks on hundreds of GPUs to train, GPN-MSA takes a fundamentally different approach by encoding evolutionary information directly through whole-genome multiple sequence alignments (MSAs) spanning 100 vertebrate species. Rather than learning conservation implicitly from large amounts of sequence data, the model is explicitly given the alignment of orthologous positions across species as input, allowing it to learn which positions are constrained by selection and which vary freely. This strategy yields strong performance at a fraction of the computational cost.
The result is a model that can score all ~9 billion possible single-nucleotide variants (SNVs) in the human genome, producing pre-computed deleteriousness scores made freely available via HuggingFace. These scores cover intronic, intergenic, splicing, and coding variants alike, making GPN-MSA one of the few methods with strong, genome-wide generalization.
GPN-MSA uses RoFormer, a transformer architecture with rotary position embeddings, applied to MSA columns rather than individual sequences. The input to the model is a 128-bp window of an MSA — a matrix of nucleotides across positions (columns) and species (rows) — with a subset of human reference positions masked. The model's task is to predict the nucleotide at each masked human position given both sequence context (adjacent positions) and evolutionary context (orthologous positions in aligned species). This cross-species attention is performed implicitly via the column-structured input representation.
Training data consists of human whole-genome alignments with 100 vertebrate species, drawn from public multi-alignment resources. To focus learning, the top 5% most conserved genomic windows are fully sampled during training, with 0.1% random sampling of the remaining genome; chromosomes 21 and 22 are held out for validation and testing respectively. On standard benchmarks for variant deleteriousness — including ClinVar pathogenic vs. benign missense variants, COSMIC somatic mutations, OMIM regulatory variants, and gnomAD rare vs. common variant enrichment — GPN-MSA outperforms or matches methods including CADD, phyloP, phastCons, and Nucleotide Transformer (2.5B parameters). It also outperforms Enformer on non-coding variant benchmarks.
GPN-MSA is designed for researchers and clinicians who need to prioritize genetic variants for functional follow-up. In rare disease genetics, the pre-computed SNV scores can be integrated into variant filtering pipelines to identify candidate pathogenic variants in patients without a diagnosis. In population genetics, the scores enable rare variant burden testing and are useful for assessing constraint on non-coding elements. In functional genomics, the deleteriousness scores can complement experimental readouts from deep mutational scanning or CRISPR screens. Because scores for all ~9 billion human SNVs are pre-computed and publicly available, GPN-MSA can be queried directly without re-running the model, lowering the barrier to adoption for wet-lab groups with limited computational infrastructure.
GPN-MSA demonstrates that incorporating evolutionary information through explicit multispecies alignments is a highly effective inductive bias for genomic AI — one that allows a modestly-sized model trained in hours to compete with or surpass billion-parameter models trained for weeks. Its publication in Nature Biotechnology and the availability of genome-wide pre-computed scores have made it a practical reference for the variant effect prediction community. A notable limitation is that the model's performance on splice-region variants lags behind specialized tools such as SpliceAI, and regions of the genome poorly represented in multi-species alignments (e.g., primate-specific non-conserved elements) may be harder to interpret. The underlying songlab-cal/gpn repository also encompasses the original GPN model trained on single-species data, providing a useful ablation context for understanding the MSA contribution specifically.
Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.
DOI: 10.1038/s41587-024-02511-w