bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

VariantFormer

Chan Zuckerberg Initiative / Chan Zuckerberg Biohub

A 1.2-billion-parameter hierarchical transformer that predicts personalized gene expression from diploid genomes, integrating individual genetic variants for ancestry-robust eQTL analysis.

Released: 2025
Parameters: 1,200,000,000

Overview

Predicting gene expression from DNA sequence is one of the foundational challenges of regulatory genomics. A functional genomic regulatory code specifies how the sequence of a genome — its enhancers, promoters, splice sites, and transcription factor binding sites — controls the level at which each gene is expressed in each tissue type. Deciphering this code computationally would enable researchers to predict the transcriptional consequences of any genetic variant, whether it is a common single-nucleotide polymorphism (SNP), a rare disease-causing mutation, or a somatic alteration in a cancer cell.

Existing sequence-based deep learning models, such as Enformer and Borzoi, make significant progress toward this goal by learning to predict regulatory signals (epigenetic marks, chromatin accessibility, gene expression) from reference genome sequences. However, these models have a fundamental limitation: they use the reference genome as input, meaning they cannot model the natural genetic variation that distinguishes one individual from another. Every human carries millions of genetic variants relative to the reference, and many of these variants influence gene expression in tissue-specific and ancestry-specific ways. Population-based genetic approaches — fine-mapping eQTLs (expression quantitative trait loci) through statistical genetics — capture individual variation but are limited to variants observed in studied cohorts and require large sample sizes to detect effects.

VariantFormer, developed by researchers at the Chan Zuckerberg Initiative and CZ Biohub and released as a preprint in November 2025, bridges this gap. It is a 1.2-billion-parameter hierarchical transformer that predicts gene-level RNA abundance directly from personalized diploid genomes — accepting not the reference genome but an individual's own diploid sequence, with both haplotypes represented, as input. Trained on 21,004 genome-transcriptome pairs from 2,330 donors across multiple tissues from the GTEx cohort, VariantFormer is the first model of its scale to treat personalized diploid genetic variation as a first-class input for expression prediction. The result is a model that achieves state-of-the-art performance on both sequence-based and population-based prediction benchmarks, generalizes across ancestries, and improves eQTL effect size estimation for low-frequency and ancestry-specific variants that are systematically underrepresented in European-centric statistical genetics datasets.

Key Features

  • Diploid genome representation: VariantFormer accepts both haplotypes of an individual's genome as input, explicitly modeling heterozygous and homozygous variant effects and allele-specific expression rather than approximating variant effects as small perturbations from a reference sequence. This is a fundamental architectural departure from reference-based sequence models.

  • Hierarchical transformer architecture: The model uses a hierarchical design that processes genomic sequence at multiple resolutions, capturing both local sequence features (transcription factor binding motifs, splice sites) and long-range regulatory interactions (enhancer-promoter loops spanning hundreds of kilobases) within a unified framework.

  • 1.2-billion parameters trained on matched genome-transcriptome pairs: VariantFormer was trained on 21,004 matched genome-transcriptome pairs from 2,330 donors across GTEx tissues. The large parameter count and matched training data allow the model to learn a high-fidelity mapping from personalized genetic sequence to tissue-specific expression, encompassing the full regulatory complexity of the human genome.

  • State-of-the-art expression prediction across tissues: VariantFormer achieves competitive or superior performance on gene expression prediction across all benchmarked tissues, outperforming both sequence-based baselines (Enformer, Borzoi) and population-based eQTL methods on held-out donors and out-of-distribution contexts including somatic mutations in cancer cell lines.

  • Improved eQTL effect size estimation for rare and ancestry-specific variants: Standard statistical eQTL methods require large sample sizes to detect low-frequency variants. VariantFormer's sequence-based representation allows it to predict the regulatory impact of any variant directly from sequence, improving effect size estimation for variants with insufficient population-level statistical power — including low-frequency and ancestry-specific alleles.

  • Robust generalization across ancestries: Models trained predominantly on European cohorts frequently underperform on non-European populations. VariantFormer demonstrates maintained predictive accuracy across diverse ancestries in the GTEx cohort, reflecting the model's learning of regulatory sequence logic rather than population-specific linkage disequilibrium patterns.

  • In silico mutagenesis for variant interpretation: VariantFormer supports systematic in silico mutagenesis — computing expression predictions for all possible single-nucleotide substitutions at any position in the genome — enabling genome-wide regulatory variant effect mapping at scale without any experimental intervention.

Technical Details

VariantFormer is a 1.2-billion-parameter hierarchical transformer. The hierarchical design processes genomic sequence at multiple scales: local convolutional or attention layers first capture short-range sequence features such as transcription factor binding motifs and splice donor/acceptor sequences, while subsequent long-range attention layers integrate these local features across the full regulatory landscape of a gene's locus, spanning potentially hundreds of kilobases upstream and downstream of the transcription start site.

Crucially, the model takes diploid sequence as input. For each genomic locus, both haplotypes are provided to the model, with individual variants introduced into the appropriate haplotype-specific sequence context. The model produces a gene-level RNA abundance prediction — quantifying steady-state transcript levels — rather than intermediate regulatory signals like chromatin accessibility or histone marks, though the regulatory landscape (including predictions of epigenetic features) can be extracted as an intermediate representation.

The model was trained on the GTEx v8 dataset, which provides genome-wide RNA-seq across up to 54 tissues and matched whole-genome sequencing for each donor. The training set comprises 21,004 genome-transcriptome pairs from 2,330 donors, with each pair contributing tissue-specific expression observations across multiple tissues. This paired design allows the model to learn both cross-tissue gene expression patterns and the personalized variant effects that distinguish one individual's expression profile from another.

Benchmark results demonstrate that VariantFormer achieves state-of-the-art Spearman correlation between predicted and observed gene expression across held-out donors, substantially outperforming Enformer and Borzoi on personalized expression prediction tasks where individual variant effects matter. On eQTL effect size estimation, VariantFormer shows particularly strong gains for low-frequency variants (minor allele frequency < 1%) and ancestry-specific variants underrepresented in European GWAS cohorts. Applied to Alzheimer's disease, gene embeddings from VariantFormer successfully prioritize known causal genes and relevant tissue contexts, and in silico mutagenesis of the APOE locus — testing the epsilon 2, epsilon 3, and epsilon 4 alleles — faithfully recovers their known risk-modifying effects on gene expression, providing an important biological validation of the model's regulatory sequence understanding.

Applications

VariantFormer is designed for researchers working at the intersection of human genetics, regulatory genomics, and precision medicine who need to predict the transcriptional consequences of specific genetic variants in tissue-specific contexts. The most immediate application is variant effect prediction: given a patient's whole-genome sequence, VariantFormer can predict the expression impact of every variant in every measured tissue, producing a full personalized regulatory map. This is directly useful for identifying the causal regulatory variants underlying disease associations from GWAS, where dozens of candidate SNPs in linkage disequilibrium with a trait-associated locus must be prioritized for functional validation. For rare disease genetics, where statistical approaches lack power due to small cohort sizes, VariantFormer's sequence-based predictions provide a way to evaluate the likely functional impact of rare or private variants without requiring matched population data. In oncology, VariantFormer's demonstrated generalization to cancer cell lines suggests it can model the expression consequences of somatic mutations that alter the regulatory landscape of tumor cells, enabling predictions about how specific oncogenic alterations influence the transcriptome. The model also supports population genetics applications: by comparing expression predictions across haplotypes with different ancestry-specific alleles, researchers can identify variants with differential regulatory effects across populations — a capability that is important for understanding why disease prevalence and drug response vary across ancestries.

Impact

VariantFormer represents a significant scaling step in sequence-based gene expression modeling by being the first model to combine 1.2-billion parameter scale with diploid, personalized sequence inputs and matched genome-transcriptome training data. The explicit diploid representation is a principled approach to a problem that simpler models paper over with first-order variant effect approximations — it allows the model to capture haplotype-specific regulatory effects, compound heterozygous interactions, and allele-specific expression that are missed by reference-based approaches. The demonstrated generalization to out-of-distribution somatic mutation contexts and across ancestries suggests the model has learned something genuine about regulatory sequence logic rather than merely memorizing donor-specific expression patterns from the training set. The improved eQTL effect size estimation for rare and ancestry-specific variants is a practically important result: large-scale statistical genetics has systematically undercharacterized the regulatory genetics of non-European populations, and sequence-based models that can predict variant effects from first principles have the potential to narrow this gap without requiring additional cohort recruitment. A key limitation is that VariantFormer was trained primarily on GTEx tissues, and its performance may degrade for cell types or tissues not represented in that dataset. The model also predicts steady-state RNA abundance and does not currently model post-transcriptional regulatory processes such as mRNA stability, RNA editing, or translation efficiency, which means predicted expression levels may not fully correspond to protein abundance or functional output. As the biobank era brings matched whole-genome sequencing and multi-tissue transcriptomics data to hundreds of thousands of individuals, models like VariantFormer are well positioned to scale further and become core tools for interpreting the regulatory genome.

Tags

variant effect predictiongene expressiontransformerfoundation modeltransfer learninggenomicsDNA

Resources

GitHub RepositoryResearch PaperOfficial Website