bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

GPN-MSA

UC Berkeley

Transformer-based DNA language model using whole-genome multispecies alignments for genome-wide variant effect prediction across coding and non-coding regions.

Released: 2023
Parameters: 86,000,000

Overview

GPN-MSA (Genomic Pre-trained Network with Multiple-Sequence Alignment) is a DNA language model developed by Yun S. Song's group at UC Berkeley that addresses a long-standing gap in genomic AI: the inability of prior DNA models to achieve strong variant effect prediction across both coding and non-coding regions of complex genomes like the human genome. The model was published as a preprint in October 2023 and subsequently appeared in Nature Biotechnology in December 2024.

Where earlier DNA language models such as Nucleotide Transformer were trained on single-species sequences and required weeks on hundreds of GPUs to train, GPN-MSA takes a fundamentally different approach by encoding evolutionary information directly through whole-genome multiple sequence alignments (MSAs) spanning 100 vertebrate species. Rather than learning conservation implicitly from large amounts of sequence data, the model is explicitly given the alignment of orthologous positions across species as input, allowing it to learn which positions are constrained by selection and which vary freely. This strategy yields strong performance at a fraction of the computational cost.

The result is a model that can score all ~9 billion possible single-nucleotide variants (SNVs) in the human genome, producing pre-computed deleteriousness scores made freely available via HuggingFace. These scores cover intronic, intergenic, splicing, and coding variants alike, making GPN-MSA one of the few methods with strong, genome-wide generalization.

Key Features

  • Multispecies alignment input: Each training example is a 128-bp window of a whole-genome MSA across 100 vertebrate species, with the 10 closest primate species withheld during training to reduce data leakage and test generalization to human-proximal variation.
  • Computationally efficient training: The 86-million parameter RoFormer model trains in approximately 3.5 hours on 4 NVIDIA A100 GPUs — compared to 28 days on 128 GPUs for Nucleotide Transformer — while achieving superior predictive performance.
  • Genome-wide coverage: Pre-computed deleteriousness scores are available for approximately 9 billion possible human SNVs, covering coding and non-coding regions without requiring per-variant inference.
  • Coding and non-coding generalization: Unlike splicing-specific tools (SpliceAI) or protein-centric approaches (ESM), GPN-MSA performs strongly across missense, synonymous, regulatory, and splice-adjacent variants in a single unified model.
  • Fast inference: The model scores approximately 25 million variants per hour, enabling large-scale analyses such as genome-wide association study (GWAS) prioritization and rare variant burden testing.
  • Weighted training objective: The masked-language modeling loss is designed to down-weight repetitive and low-complexity genomic elements and up-weight conserved regions, focusing model capacity on functionally relevant sequence.

Technical Details

GPN-MSA uses RoFormer, a transformer architecture with rotary position embeddings, applied to MSA columns rather than individual sequences. The input to the model is a 128-bp window of an MSA — a matrix of nucleotides across positions (columns) and species (rows) — with a subset of human reference positions masked. The model's task is to predict the nucleotide at each masked human position given both sequence context (adjacent positions) and evolutionary context (orthologous positions in aligned species). This cross-species attention is performed implicitly via the column-structured input representation.

Training data consists of human whole-genome alignments with 100 vertebrate species, drawn from public multi-alignment resources. To focus learning, the top 5% most conserved genomic windows are fully sampled during training, with 0.1% random sampling of the remaining genome; chromosomes 21 and 22 are held out for validation and testing respectively. On standard benchmarks for variant deleteriousness — including ClinVar pathogenic vs. benign missense variants, COSMIC somatic mutations, OMIM regulatory variants, and gnomAD rare vs. common variant enrichment — GPN-MSA outperforms or matches methods including CADD, phyloP, phastCons, and Nucleotide Transformer (2.5B parameters). It also outperforms Enformer on non-coding variant benchmarks.

Applications

GPN-MSA is designed for researchers and clinicians who need to prioritize genetic variants for functional follow-up. In rare disease genetics, the pre-computed SNV scores can be integrated into variant filtering pipelines to identify candidate pathogenic variants in patients without a diagnosis. In population genetics, the scores enable rare variant burden testing and are useful for assessing constraint on non-coding elements. In functional genomics, the deleteriousness scores can complement experimental readouts from deep mutational scanning or CRISPR screens. Because scores for all ~9 billion human SNVs are pre-computed and publicly available, GPN-MSA can be queried directly without re-running the model, lowering the barrier to adoption for wet-lab groups with limited computational infrastructure.

Impact

GPN-MSA demonstrates that incorporating evolutionary information through explicit multispecies alignments is a highly effective inductive bias for genomic AI — one that allows a modestly-sized model trained in hours to compete with or surpass billion-parameter models trained for weeks. Its publication in Nature Biotechnology and the availability of genome-wide pre-computed scores have made it a practical reference for the variant effect prediction community. A notable limitation is that the model's performance on splice-region variants lags behind specialized tools such as SpliceAI, and regions of the genome poorly represented in multi-species alignments (e.g., primate-specific non-conserved elements) may be harder to interpret. The underlying songlab-cal/gpn repository also encompasses the original GPN model trained on single-species data, providing a useful ablation context for understanding the MSA contribution specifically.

Citation

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

Benegas, G., et al. (2025) A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology.

DOI: 10.1038/s41587-024-02511-w

Metrics

GitHub

Stars339
Forks50
Open Issues2
Contributors4
Last Push1mo ago
LanguageJupyter Notebook
LicenseMIT

Citations

Total Citations65
Influential6
References57

Tags

variant effect predictionMSA-basedfoundation modelDNAgenomics

Resources

GitHub RepositoryResearch PaperHuggingFace Model