Basenji

Dilated convolutional network that predicts cell-type-specific epigenetic and transcriptional profiles from DNA sequence across mammalian genomes.

Released: May 2018

Basenji is a deep convolutional neural network developed at Calico Life Sciences by David Kelley and colleagues that predicts cell-type-specific epigenetic and transcriptional profiles directly from DNA sequence across large mammalian genomes. Published in Genome Research in 2018, the model extended the earlier Basset architecture to handle distal regulatory interactions — a fundamental challenge in regulatory genomics that had been largely intractable with prior approaches.

The central problem Basenji was designed to address is the sequence-to-function mapping challenge: given a stretch of DNA, how does that sequence determine the quantitative levels of gene expression and chromatin accessibility across dozens or hundreds of different cell types and tissues? This is complicated by the fact that regulatory elements such as enhancers can act over tens of thousands of base pairs of genomic distance, requiring models capable of integrating information from broad sequence contexts. Basenji accomplishes this through dilated convolutional layers that progressively expand the model's receptive field while keeping computational costs manageable, enabling the model to simultaneously identify promoters and distal regulatory elements and synthesize their collective contributions to predict quantitative genomic profiles.

Unlike its predecessor Basset, which predicted binary chromatin accessibility across cell types, Basenji predicts continuous, quantitative signals — including RNA-seq and ChIP-seq tracks — at high genomic resolution. This shift from classification to regression allowed the model to capture the dynamic range of gene expression and epigenomic signals that vary across cell types, making it far more informative for studying cis-regulatory variation. Basenji also moved from single-position predictions to sequential predictions along entire chromosomes, reflecting the continuous nature of genomic regulatory activity.

Key Features

Dilated convolutional architecture: Dilated (atrous) convolutions progressively expand the receptive field across the chromosome, allowing the model to integrate information from regulatory elements tens of kilobases away from any given position without proportionally increasing compute requirements.
Multi-task quantitative prediction: Trained simultaneously on more than 4,000 genomic datasets spanning RNA-seq, CAGE, DNase-seq, ATAC-seq, and ChIP-seq experiments across many cell types, sharing information across experiments to improve generalization.
Sequential chromosomal predictions: Unlike window-based models, Basenji makes predictions along entire chromosomes in a sliding fashion, preserving local context while enabling genome-wide coverage.
Variant effect scoring: Scores the functional impact of single nucleotide variants by comparing reference and alternate allele predictions across all tracks, producing a quantitative score profile useful for prioritizing regulatory variants from GWAS and eQTL studies.
Cell-type-specific regulatory modeling: Jointly predicts profiles for many cell types from a single model pass, enabling comparative studies of how sequence variation has different regulatory consequences across tissue contexts.
Open weights and code: Model weights and training pipelines are released openly via the Calico GitHub, facilitating fine-tuning and extension for new datasets and organisms.

Technical Details

Basenji uses a hierarchical convolutional architecture specifically designed for long-range sequence modeling. Input DNA is one-hot encoded (4 channels) and passed through an initial stack of standard convolutional blocks with max-pooling to downsample the sequence and extract local motifs. These representations are then processed by layers of dilated convolutions with exponentially increasing dilation rates, which expand the effective receptive field from a few hundred to several thousand base pairs while maintaining efficient computation. The architecture was trained on over 4,000 genomic datasets drawn from ENCODE and Roadmap Epigenomics, including CAGE-seq, RNA-seq, DNase-seq, ATAC-seq, and ChIP-seq experiments across a diverse range of human cell lines and tissues. Targets were binned at 128 bp resolution along chromosomes, and the model was optimized using a Poisson regression loss appropriate for count-like sequencing data. Importantly, the model shares all parameters across cell types — cell-type specificity emerges entirely from the learned relationship between local sequence features and distal regulatory context encoded in the training data. In benchmark analyses, Basenji substantially improved on Basset for predicting gene expression from sequence, achieving higher Pearson correlations on held-out chromosomes for both CAGE-seq (correlations typically exceeding 0.6 for protein-coding genes) and DNase-seq profiles. Variant effect predictions from Basenji correlated significantly with eQTL effect sizes from the GTEx consortium across multiple tissues, validating the model's ability to learn biologically meaningful regulatory logic from sequence alone.

Applications

Basenji is used by computational biologists and human geneticists working on regulatory genomics, noncoding variant interpretation, and gene regulation. A primary application is the in-silico scoring of noncoding variants from GWAS or clinical sequencing studies: by comparing predicted epigenomic profiles between reference and alternate alleles, Basenji generates quantitative evidence for or against a causal regulatory role for each variant across many cell types simultaneously. The model has also been applied to study the mechanistic basis of eQTLs by identifying which cell types and regulatory tracks are most perturbed by a given variant. Researchers have used Basenji predictions for comparative genomics, studying how regulatory sequences have changed across mammalian evolution, and for prioritizing regulatory elements for CRISPR perturbation experiments. The architecture and training pipeline served as the direct foundation for both Enformer and Borzoi, the leading models in this lineage.

Impact

Basenji established a new paradigm for quantitative sequence-to-function modeling at genome scale, shifting the field from binary accessibility prediction toward continuous, multi-track regulatory prediction. Its introduction of dilated convolutions for long-range regulatory integration was widely adopted in subsequent genomic deep learning architectures. The Genome Research paper has been cited extensively and is recognized as a landmark contribution to computational regulatory genomics. The model directly enabled the development of Enformer, which replaced convolutional long-range integration with transformer self-attention, and Borzoi, which further expanded context and resolution. A notable limitation is that the model processes sequence only, without incorporating any experimental epigenomic measurements that could condition predictions on cell-type-specific chromatin state, and the convolutional architecture has a fixed maximum receptive field that cannot grow beyond the dilated stack's coverage. Despite these constraints, Basenji remains a widely used reference model and its open codebase continues to serve as the foundation for sequence-to-function research at Calico and beyond.

Citation

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Preprint

Kelley, D. R., et al. (2017) Sequential regulatory activity prediction across chromosomes with convolutional neural networks. bioRxiv.

DOI: 10.1101/gr.227819.117

Recent citations

Papers that recently cited this model.

mwHIT: accelerated and accurate histone modification imputation using multi-scale window attention
Zhaoxi Zhang, Lijuan Jia, Xiaoya Fan, et al.
Frontiers in Genetics · Jul 2026
0
GAMETE maps the genetic architecture of chromatin accessibility in rice pollen at single-nucleus resolution
Yinmeng Liu, Chunjiao Xia, Junjie Li, et al.
bioRxiv · Jul 2026
0
K-attention: a biologically informed attention operator for data-efficient sequence-based omics modeling
Tao Liu, Jing-Yi Li, Ziyu Chen, et al.
Briefings in Bioinformatics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications
W. Samek, G. Montavon, S. Lapuschkin, et al.
Proceedings of the IEEE · Mar 2021
1.2K
Effective gene expression prediction from sequence by integrating long-range interactions
Žiga Avsec, Vikram Agarwal, D. Visentin, et al.
Nature Methods · Apr 2021
1.2K
Deep learning: new computational modelling techniques for genomics
Gökçen Eraslan, Žiga Avsec, J. Gagneur, et al.
Nature reviews genetics · Jul 2019
1K
Base-resolution models of transcription factor binding reveal soft motif syntax
Žiga Avsec, Melanie Weilert, Avanti Shrikumar, et al.
Nature Genetics · Feb 2021
541
The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
Hugo Dalla-torre, Liam Gonzalez, Javier Mendoza Revilla, et al.
bioRxiv · Oct 2024
508

Citations

Total Citations515

Influential54

References75

GitHub

Stars473

Forks137

Open Issues87

Contributors7

Last Push6mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Biology69%
Computer Science68%
Medicine54%
Environmental Science3%
Agricultural and Food Sciences1%
Engineering1%
Mathematics1%
Economics0%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

73Open

Usability — can I run it?95

Reproducibility — can I retrain it?58

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Documentation Dataset

Key Features

Dilated convolutional architecture: Dilated (atrous) convolutions progressively expand the receptive field across the chromosome, allowing the model to integrate information from regulatory elements tens of kilobases away from any given position without proportionally increasing compute requirements.

Multi-task quantitative prediction: Trained simultaneously on more than 4,000 genomic datasets spanning RNA-seq, CAGE, DNase-seq, ATAC-seq, and ChIP-seq experiments across many cell types, sharing information across experiments to improve generalization.

Sequential chromosomal predictions: Unlike window-based models, Basenji makes predictions along entire chromosomes in a sliding fashion, preserving local context while enabling genome-wide coverage.

Variant effect scoring: Scores the functional impact of single nucleotide variants by comparing reference and alternate allele predictions across all tracks, producing a quantitative score profile useful for prioritizing regulatory variants from GWAS and eQTL studies.

Cell-type-specific regulatory modeling: Jointly predicts profiles for many cell types from a single model pass, enabling comparative studies of how sequence variation has different regulatory consequences across tissue contexts.

Open weights and code: Model weights and training pipelines are released openly via the Calico GitHub, facilitating fine-tuning and extension for new datasets and organisms.

Technical Details

Applications

Impact

Basenji

#Key Features

#Technical Details

#Applications

#Impact

Citation

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Basenji

#Key Features

#Technical Details

#Applications

#Impact

Citation

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact