Deep learning model predicting cell-type-specific RNA-seq coverage at 32 bp resolution from 524 kb of DNA sequence, jointly modeling transcription, splicing, and polyadenylation.
Borzoi is a deep convolutional-transformer neural network developed at Calico Life Sciences that predicts RNA-seq read coverage directly from DNA sequence. Published in Nature Genetics in 2025, it addresses a longstanding fragmentation in regulatory genomics: transcription levels, alternative splicing, and polyadenylation have traditionally required separate, task-specific models. Borzoi unifies all three processes in a single forward pass, earning its description as a "unifying model of gene regulation."
The model accepts 524 kilobases of genomic sequence as input — 2.5 times the context window of its predecessor Enformer — and produces coverage predictions at 32 bp resolution. This combination of broad context and fine resolution allows Borzoi to capture long-range enhancer–promoter interactions while still resolving individual exons and splice sites within the same model. Predictions can be made for hundreds of human and mouse cell types and tissues simultaneously, making the model well suited for investigating how genetic variation shapes cell-type-specific gene regulation.
Borzoi uses a hybrid convolutional and transformer architecture. A convolutional stem processes the 524,288 bp one-hot encoded DNA input, progressively downsampling through max-pooling to produce 4,096 embeddings at 128 bp resolution. Eight transformer blocks with multi-head self-attention and relative positional encodings then operate over these embeddings, allowing the model to learn long-range dependencies without fixed positional assumptions. A U-Net decoder with skip connections from the convolutional stem upsamples predictions back to 32 bp resolution. Separate linear output heads produce coverage tracks for each RNA-seq or genomic assay target.
Training data comprised 866 human and 279 mouse RNA-seq datasets from ENCODE, strand-specific RNA-seq from 31 GTEx tissues, DNase-seq and ATAC-seq from ENCODE and CATlas, ChIP-seq histone and transcription factor data, and CAGE data from FANTOM5. Four model replicates were trained using 2-fold and 4-fold cross-validation schemes. Borzoi substantially outperforms Enformer on held-out RNA-seq track prediction across human and mouse tissues, and achieves competitive or superior performance to specialized models on GTEx eQTL, sQTL, and paQTL scoring benchmarks.
Borzoi is suited for any research connecting DNA sequence to RNA output. Variant interpretation is a primary use case: researchers can score the effect of non-coding SNPs and indels on gene expression, splice site usage, and polyadenylation — all from a single model pass — making it directly applicable to GWAS variant prioritization and clinical variant classification. Regulatory element characterization via in-silico mutagenesis or gradient-based attribution identifies which bases within the 524 kb window drive predicted expression changes. The model also supports splicing analysis, gene therapy construct design (predicting how synthetic sequences affect transcription and RNA processing), and comparative genomics studies using the jointly trained human and mouse models.
Borzoi represents a significant advance in sequence-to-function modeling by demonstrating that transcription, splicing, and polyadenylation can be jointly learned from raw sequence at scale. Its extended context window and U-Net decoder design have influenced subsequent regulatory genomics models. The release of four model replicates with open weights on GitHub (for non-commercial research use), along with Jupyter notebook workflows for eQTL scoring and mutagenesis, has enabled broad adoption. Key limitations include the model's sequence-only input (chromatin state and epigenetic marks must be learned implicitly), its prediction of read coverage rather than discrete isoform abundances, and computational cost at inference scale due to transformer attention over 4,096 positions on 524 kb inputs.
Linder, J., et al. (2023) Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bioRxiv.
DOI: 10.1038/s41588-024-02053-6