GPN-MSA

DNA language model for variant effect prediction across coding and non-coding regions, using whole-genome alignments of 100 vertebrate species.

Released: October 2023

Parameters: 86 Million

GPN-MSA (Genomic Pre-trained Network with Multiple-Sequence Alignment) is a DNA language model developed by Yun S. Song's group at UC Berkeley that addresses a long-standing gap in genomic AI: the inability of prior DNA models to achieve strong variant effect prediction across both coding and non-coding regions of complex genomes like the human genome. The model was published as a preprint in October 2023 and subsequently appeared in Nature Biotechnology in December 2024.

Where earlier DNA language models such as Nucleotide Transformer were trained on single-species sequences and required weeks on hundreds of GPUs to train, GPN-MSA takes a fundamentally different approach by encoding evolutionary information directly through whole-genome multiple sequence alignments (MSAs) spanning 100 vertebrate species. Rather than learning conservation implicitly from large amounts of sequence data, the model is explicitly given the alignment of orthologous positions across species as input, allowing it to learn which positions are constrained by selection and which vary freely. This strategy yields strong performance at a fraction of the computational cost.

The result is a model that can score all ~9 billion possible single-nucleotide variants (SNVs) in the human genome, producing pre-computed deleteriousness scores made freely available via HuggingFace. These scores cover intronic, intergenic, splicing, and coding variants alike, making GPN-MSA one of the few methods with strong, genome-wide generalization.

Key Features

Multispecies alignment input: Each training example is a 128-bp window of a whole-genome MSA across 100 vertebrate species, with the 10 closest primate species withheld during training to reduce data leakage and test generalization to human-proximal variation.
Computationally efficient training: The 86-million parameter RoFormer model trains in approximately 3.5 hours on 4 NVIDIA A100 GPUs — compared to 28 days on 128 GPUs for Nucleotide Transformer — while achieving superior predictive performance.
Genome-wide coverage: Pre-computed deleteriousness scores are available for approximately 9 billion possible human SNVs, covering coding and non-coding regions without requiring per-variant inference.
Coding and non-coding generalization: Unlike splicing-specific tools (SpliceAI) or protein-centric approaches (ESM), GPN-MSA performs strongly across missense, synonymous, regulatory, and splice-adjacent variants in a single unified model.
Fast inference: The model scores approximately 25 million variants per hour, enabling large-scale analyses such as genome-wide association study (GWAS) prioritization and rare variant burden testing.
Weighted training objective: The masked-language modeling loss is designed to down-weight repetitive and low-complexity genomic elements and up-weight conserved regions, focusing model capacity on functionally relevant sequence.

Technical Details

GPN-MSA uses RoFormer, a transformer architecture with rotary position embeddings, applied to MSA columns rather than individual sequences. The input to the model is a 128-bp window of an MSA — a matrix of nucleotides across positions (columns) and species (rows) — with a subset of human reference positions masked. The model's task is to predict the nucleotide at each masked human position given both sequence context (adjacent positions) and evolutionary context (orthologous positions in aligned species). This cross-species attention is performed implicitly via the column-structured input representation.

Training data consists of human whole-genome alignments with 100 vertebrate species, drawn from public multi-alignment resources. To focus learning, the top 5% most conserved genomic windows are fully sampled during training, with 0.1% random sampling of the remaining genome; chromosomes 21 and 22 are held out for validation and testing respectively. On standard benchmarks for variant deleteriousness — including ClinVar pathogenic vs. benign missense variants, COSMIC somatic mutations, OMIM regulatory variants, and gnomAD rare vs. common variant enrichment — GPN-MSA outperforms or matches methods including CADD, phyloP, phastCons, and Nucleotide Transformer (2.5B parameters). It also outperforms Enformer on non-coding variant benchmarks.

Applications

GPN-MSA is designed for researchers and clinicians who need to prioritize genetic variants for functional follow-up. In rare disease genetics, the pre-computed SNV scores can be integrated into variant filtering pipelines to identify candidate pathogenic variants in patients without a diagnosis. In population genetics, the scores enable rare variant burden testing and are useful for assessing constraint on non-coding elements. In functional genomics, the deleteriousness scores can complement experimental readouts from deep mutational scanning or CRISPR screens. Because scores for all ~9 billion human SNVs are pre-computed and publicly available, GPN-MSA can be queried directly without re-running the model, lowering the barrier to adoption for wet-lab groups with limited computational infrastructure.

Impact

GPN-MSA demonstrates that incorporating evolutionary information through explicit multispecies alignments is a highly effective inductive bias for genomic AI — one that allows a modestly-sized model trained in hours to compete with or surpass billion-parameter models trained for weeks. Its publication in Nature Biotechnology and the availability of genome-wide pre-computed scores have made it a practical reference for the variant effect prediction community. A notable limitation is that the model's performance on splice-region variants lags behind specialized tools such as SpliceAI, and regions of the genome poorly represented in multi-species alignments (e.g., primate-specific non-conserved elements) may be harder to interpret. The underlying songlab-cal/gpn repository also encompasses the original GPN model trained on single-species data, providing a useful ablation context for understanding the MSA contribution specifically.

Citation

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.

DOI: 10.1038/s41587-024-02511-w

Recent citations

Papers that recently cited this model.

Prediction of human pathogenic start loss variants based on iterative feature representation learning
Jinsong Cai, Chen Wei, Junfeng Xia, et al.
PeerJ Computer Science · Jul 2026
0
Current challenges in GWAS integration and fine-mapping for variant interpretation
Omar Ahmed, Neha Saravanan, A. B. Rovsing, et al.
bioRxiv · Jul 2026
0
Advancing bioinformatics with language models: components, applications, and perspectives
Jiajia Liu, Mengyuan Yang, Yankai Yu, et al.
Briefings in Bioinformatics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Transformers and genome language models
Micaela Elisa Consens, Cameron Dufault, Michael Wainberg, et al.
Nature Machine Intelligence · Mar 2025
80
Genome modelling and design across all domains of life with Evo 2
G. Brixi, Matthew G. Durrant, Jerome Ku, et al.
Nature · Mar 2026
59
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, et al.
arXiv.org · May 2025
40
GENERator: A Long-Context Generative Genomic Foundation Model
Wei Wu, Qiuyi Li, Yuanyuan Zhang, et al.
Feb 2025
38
Genomic language models: opportunities and challenges.
Gonzalo Benegas, Chengzhong Ye, Carlos Albors, et al.
Trends in Genetics · Jan 2025
33

Citations

Total Citations90

Influential11

References57

GitHub

Stars349

Forks49

Open Issues3

Contributors4

Last Push11h ago

LanguageJupyter Notebook

LicenseMIT

HuggingFace

Downloads216

Likes8

Last Modified1y ago

Pipelinefill-mask

Fields of citing research

Biology92%
Computer Science90%
Medicine52%
Environmental Science7%
Agricultural and Food Sciences3%
Law1%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

87Open

Usability — can I run it?95

Reproducibility — can I retrain it?87

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Multispecies alignment input: Each training example is a 128-bp window of a whole-genome MSA across 100 vertebrate species, with the 10 closest primate species withheld during training to reduce data leakage and test generalization to human-proximal variation.

Computationally efficient training: The 86-million parameter RoFormer model trains in approximately 3.5 hours on 4 NVIDIA A100 GPUs — compared to 28 days on 128 GPUs for Nucleotide Transformer — while achieving superior predictive performance.

Genome-wide coverage: Pre-computed deleteriousness scores are available for approximately 9 billion possible human SNVs, covering coding and non-coding regions without requiring per-variant inference.

Coding and non-coding generalization: Unlike splicing-specific tools (SpliceAI) or protein-centric approaches (ESM), GPN-MSA performs strongly across missense, synonymous, regulatory, and splice-adjacent variants in a single unified model.

Fast inference: The model scores approximately 25 million variants per hour, enabling large-scale analyses such as genome-wide association study (GWAS) prioritization and rare variant burden testing.

Weighted training objective: The masked-language modeling loss is designed to down-weight repetitive and low-complexity genomic elements and up-weight conserved regions, focusing model capacity on functionally relevant sequence.

Technical Details

Applications

Impact

Top citations

The most-cited papers that cite this model.

GENERator: A Long-Context Generative Genomic Foundation Model

Wei Wu, Qiuyi Li, Yuanyuan Zhang, et al.

Feb 2025

Genomic language models: opportunities and challenges.

Gonzalo Benegas, Chengzhong Ye, Carlos Albors, et al.

Trends in Genetics · Jan 2025

GPN-MSA

#Key Features

#Technical Details

#Applications

#Impact

Citation

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Recent citations

Top citations

GENERator: A Long-Context Generative Genomic Foundation Model

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GPN-MSA

#Key Features

#Technical Details

#Applications

#Impact

Citation

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Recent citations

Top citations

GENERator: A Long-Context Generative Genomic Foundation Model

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact