Calico Life Sciences
Deep convolutional neural network that predicts cell-type-specific epigenetic and transcriptional profiles from DNA sequence across large mammalian genomes.
Basenji is a deep convolutional neural network developed at Calico Life Sciences by David Kelley and colleagues that predicts cell-type-specific epigenetic and transcriptional profiles directly from DNA sequence across large mammalian genomes. Published in Genome Research in 2018, the model extended the earlier Basset architecture to handle distal regulatory interactions — a fundamental challenge in regulatory genomics that had been largely intractable with prior approaches.
The central problem Basenji was designed to address is the sequence-to-function mapping challenge: given a stretch of DNA, how does that sequence determine the quantitative levels of gene expression and chromatin accessibility across dozens or hundreds of different cell types and tissues? This is complicated by the fact that regulatory elements such as enhancers can act over tens of thousands of base pairs of genomic distance, requiring models capable of integrating information from broad sequence contexts. Basenji accomplishes this through dilated convolutional layers that progressively expand the model's receptive field while keeping computational costs manageable, enabling the model to simultaneously identify promoters and distal regulatory elements and synthesize their collective contributions to predict quantitative genomic profiles.
Unlike its predecessor Basset, which predicted binary chromatin accessibility across cell types, Basenji predicts continuous, quantitative signals — including RNA-seq and ChIP-seq tracks — at high genomic resolution. This shift from classification to regression allowed the model to capture the dynamic range of gene expression and epigenomic signals that vary across cell types, making it far more informative for studying cis-regulatory variation. Basenji also moved from single-position predictions to sequential predictions along entire chromosomes, reflecting the continuous nature of genomic regulatory activity.
Basenji uses a hierarchical convolutional architecture specifically designed for long-range sequence modeling. Input DNA is one-hot encoded (4 channels) and passed through an initial stack of standard convolutional blocks with max-pooling to downsample the sequence and extract local motifs. These representations are then processed by layers of dilated convolutions with exponentially increasing dilation rates, which expand the effective receptive field from a few hundred to several thousand base pairs while maintaining efficient computation. The architecture was trained on over 4,000 genomic datasets drawn from ENCODE and Roadmap Epigenomics, including CAGE-seq, RNA-seq, DNase-seq, ATAC-seq, and ChIP-seq experiments across a diverse range of human cell lines and tissues. Targets were binned at 128 bp resolution along chromosomes, and the model was optimized using a Poisson regression loss appropriate for count-like sequencing data. Importantly, the model shares all parameters across cell types — cell-type specificity emerges entirely from the learned relationship between local sequence features and distal regulatory context encoded in the training data. In benchmark analyses, Basenji substantially improved on Basset for predicting gene expression from sequence, achieving higher Pearson correlations on held-out chromosomes for both CAGE-seq (correlations typically exceeding 0.6 for protein-coding genes) and DNase-seq profiles. Variant effect predictions from Basenji correlated significantly with eQTL effect sizes from the GTEx consortium across multiple tissues, validating the model's ability to learn biologically meaningful regulatory logic from sequence alone.
Basenji is used by computational biologists and human geneticists working on regulatory genomics, noncoding variant interpretation, and gene regulation. A primary application is the in-silico scoring of noncoding variants from GWAS or clinical sequencing studies: by comparing predicted epigenomic profiles between reference and alternate alleles, Basenji generates quantitative evidence for or against a causal regulatory role for each variant across many cell types simultaneously. The model has also been applied to study the mechanistic basis of eQTLs by identifying which cell types and regulatory tracks are most perturbed by a given variant. Researchers have used Basenji predictions for comparative genomics, studying how regulatory sequences have changed across mammalian evolution, and for prioritizing regulatory elements for CRISPR perturbation experiments. The architecture and training pipeline served as the direct foundation for both Enformer and Borzoi, the leading models in this lineage.
Basenji established a new paradigm for quantitative sequence-to-function modeling at genome scale, shifting the field from binary accessibility prediction toward continuous, multi-track regulatory prediction. Its introduction of dilated convolutions for long-range regulatory integration was widely adopted in subsequent genomic deep learning architectures. The Genome Research paper has been cited extensively and is recognized as a landmark contribution to computational regulatory genomics. The model directly enabled the development of Enformer, which replaced convolutional long-range integration with transformer self-attention, and Borzoi, which further expanded context and resolution. A notable limitation is that the model processes sequence only, without incorporating any experimental epigenomic measurements that could condition predictions on cell-type-specific chromatin state, and the convolutional architecture has a fixed maximum receptive field that cannot grow beyond the dilated stack's coverage. Despite these constraints, Basenji remains a widely used reference model and its open codebase continues to serve as the foundation for sequence-to-function research at Calico and beyond.