bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

Borzoi

Calico Life Sciences

Deep learning model predicting cell-type-specific RNA-seq coverage at 32 bp resolution from 524 kb of DNA sequence, jointly modeling transcription, splicing, and polyadenylation.

Released: 2025

Overview

Borzoi is a deep convolutional-transformer neural network developed at Calico Life Sciences that predicts RNA-seq read coverage directly from DNA sequence. Published in Nature Genetics in 2025, it addresses a longstanding fragmentation in regulatory genomics: transcription levels, alternative splicing, and polyadenylation have traditionally required separate, task-specific models. Borzoi unifies all three processes in a single forward pass, earning its description as a "unifying model of gene regulation."

The model accepts 524 kilobases of genomic sequence as input — 2.5 times the context window of its predecessor Enformer — and produces coverage predictions at 32 bp resolution. This combination of broad context and fine resolution allows Borzoi to capture long-range enhancer–promoter interactions while still resolving individual exons and splice sites within the same model. Predictions can be made for hundreds of human and mouse cell types and tissues simultaneously, making the model well suited for investigating how genetic variation shapes cell-type-specific gene regulation.

Key Features

  • Extended sequence context: Accepts 524,288 bp of input DNA, enabling the model to capture distal regulatory elements and long-range chromatin interactions across entire gene loci.
  • High-resolution output: Predicts RNA-seq coverage at 32 bp resolution, four times finer than Enformer, sufficient to distinguish individual exons, splice sites, and polyadenylation signals.
  • Unified regulatory modeling: Jointly predicts transcription, alternative splicing, and cleavage/polyadenylation from a single model, enabling variant scoring across all three regulatory processes simultaneously.
  • Multi-species training: Trained on paired human (hg38) and mouse (mm10) genomic data, improving generalization across evolutionarily conserved regulatory elements.
  • Ensemble of replicates: Released as four independently trained model replicates, supporting uncertainty quantification through prediction averaging.
  • Variant effect scoring: Scores non-coding genetic variants (eQTLs, sQTLs, paQTLs) by comparing predicted coverage between reference and alternate alleles using in-silico mutagenesis.

Technical Details

Borzoi uses a hybrid convolutional and transformer architecture. A convolutional stem processes the 524,288 bp one-hot encoded DNA input, progressively downsampling through max-pooling to produce 4,096 embeddings at 128 bp resolution. Eight transformer blocks with multi-head self-attention and relative positional encodings then operate over these embeddings, allowing the model to learn long-range dependencies without fixed positional assumptions. A U-Net decoder with skip connections from the convolutional stem upsamples predictions back to 32 bp resolution. Separate linear output heads produce coverage tracks for each RNA-seq or genomic assay target.

Training data comprised 866 human and 279 mouse RNA-seq datasets from ENCODE, strand-specific RNA-seq from 31 GTEx tissues, DNase-seq and ATAC-seq from ENCODE and CATlas, ChIP-seq histone and transcription factor data, and CAGE data from FANTOM5. Four model replicates were trained using 2-fold and 4-fold cross-validation schemes. Borzoi substantially outperforms Enformer on held-out RNA-seq track prediction across human and mouse tissues, and achieves competitive or superior performance to specialized models on GTEx eQTL, sQTL, and paQTL scoring benchmarks.

Applications

Borzoi is suited for any research connecting DNA sequence to RNA output. Variant interpretation is a primary use case: researchers can score the effect of non-coding SNPs and indels on gene expression, splice site usage, and polyadenylation — all from a single model pass — making it directly applicable to GWAS variant prioritization and clinical variant classification. Regulatory element characterization via in-silico mutagenesis or gradient-based attribution identifies which bases within the 524 kb window drive predicted expression changes. The model also supports splicing analysis, gene therapy construct design (predicting how synthetic sequences affect transcription and RNA processing), and comparative genomics studies using the jointly trained human and mouse models.

Impact

Borzoi represents a significant advance in sequence-to-function modeling by demonstrating that transcription, splicing, and polyadenylation can be jointly learned from raw sequence at scale. Its extended context window and U-Net decoder design have influenced subsequent regulatory genomics models. The release of four model replicates with open weights on GitHub (for non-commercial research use), along with Jupyter notebook workflows for eQTL scoring and mutagenesis, has enabled broad adoption. Key limitations include the model's sequence-only input (chromatin state and epigenetic marks must be learned implicitly), its prediction of read coverage rather than discrete isoform abundances, and computational cost at inference scale due to transformer attention over 4,096 positions on 524 kb inputs.

Citation

Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation

Linder, J., et al. (2023) Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bioRxiv.

DOI: 10.1038/s41588-024-02053-6

Metrics

GitHub

Stars236
Forks31
Open Issues13
Contributors5
Last Push7mo ago
LanguagePython
LicenseApache-2.0

Citations

Total Citations207
Influential25
References120

Tags

gene expressionregulatory genomicssequence to functionvariant effect predictionCNNtransformerRNA-seqsplicing

Resources

GitHub RepositoryResearch Paper