bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

Basenji

Calico Life Sciences

Deep convolutional neural network that predicts cell-type-specific epigenetic and transcriptional profiles from DNA sequence across large mammalian genomes.

Released: 2018

Overview

Basenji is a deep convolutional neural network developed at Calico Life Sciences by David Kelley and colleagues that predicts cell-type-specific epigenetic and transcriptional profiles directly from DNA sequence across large mammalian genomes. Published in Genome Research in 2018, the model extended the earlier Basset architecture to handle distal regulatory interactions — a fundamental challenge in regulatory genomics that had been largely intractable with prior approaches.

The central problem Basenji was designed to address is the sequence-to-function mapping challenge: given a stretch of DNA, how does that sequence determine the quantitative levels of gene expression and chromatin accessibility across dozens or hundreds of different cell types and tissues? This is complicated by the fact that regulatory elements such as enhancers can act over tens of thousands of base pairs of genomic distance, requiring models capable of integrating information from broad sequence contexts. Basenji accomplishes this through dilated convolutional layers that progressively expand the model's receptive field while keeping computational costs manageable, enabling the model to simultaneously identify promoters and distal regulatory elements and synthesize their collective contributions to predict quantitative genomic profiles.

Unlike its predecessor Basset, which predicted binary chromatin accessibility across cell types, Basenji predicts continuous, quantitative signals — including RNA-seq and ChIP-seq tracks — at high genomic resolution. This shift from classification to regression allowed the model to capture the dynamic range of gene expression and epigenomic signals that vary across cell types, making it far more informative for studying cis-regulatory variation. Basenji also moved from single-position predictions to sequential predictions along entire chromosomes, reflecting the continuous nature of genomic regulatory activity.

Key Features

  • Dilated convolutional architecture: Dilated (atrous) convolutions progressively expand the receptive field across the chromosome, allowing the model to integrate information from regulatory elements tens of kilobases away from any given position without proportionally increasing compute requirements.
  • Multi-task quantitative prediction: Trained simultaneously on more than 4,000 genomic datasets spanning RNA-seq, CAGE, DNase-seq, ATAC-seq, and ChIP-seq experiments across many cell types, sharing information across experiments to improve generalization.
  • Sequential chromosomal predictions: Unlike window-based models, Basenji makes predictions along entire chromosomes in a sliding fashion, preserving local context while enabling genome-wide coverage.
  • Variant effect scoring: Scores the functional impact of single nucleotide variants by comparing reference and alternate allele predictions across all tracks, producing a quantitative score profile useful for prioritizing regulatory variants from GWAS and eQTL studies.
  • Cell-type-specific regulatory modeling: Jointly predicts profiles for many cell types from a single model pass, enabling comparative studies of how sequence variation has different regulatory consequences across tissue contexts.
  • Open weights and code: Model weights and training pipelines are released openly via the Calico GitHub, facilitating fine-tuning and extension for new datasets and organisms.

Technical Details

Basenji uses a hierarchical convolutional architecture specifically designed for long-range sequence modeling. Input DNA is one-hot encoded (4 channels) and passed through an initial stack of standard convolutional blocks with max-pooling to downsample the sequence and extract local motifs. These representations are then processed by layers of dilated convolutions with exponentially increasing dilation rates, which expand the effective receptive field from a few hundred to several thousand base pairs while maintaining efficient computation. The architecture was trained on over 4,000 genomic datasets drawn from ENCODE and Roadmap Epigenomics, including CAGE-seq, RNA-seq, DNase-seq, ATAC-seq, and ChIP-seq experiments across a diverse range of human cell lines and tissues. Targets were binned at 128 bp resolution along chromosomes, and the model was optimized using a Poisson regression loss appropriate for count-like sequencing data. Importantly, the model shares all parameters across cell types — cell-type specificity emerges entirely from the learned relationship between local sequence features and distal regulatory context encoded in the training data. In benchmark analyses, Basenji substantially improved on Basset for predicting gene expression from sequence, achieving higher Pearson correlations on held-out chromosomes for both CAGE-seq (correlations typically exceeding 0.6 for protein-coding genes) and DNase-seq profiles. Variant effect predictions from Basenji correlated significantly with eQTL effect sizes from the GTEx consortium across multiple tissues, validating the model's ability to learn biologically meaningful regulatory logic from sequence alone.

Applications

Basenji is used by computational biologists and human geneticists working on regulatory genomics, noncoding variant interpretation, and gene regulation. A primary application is the in-silico scoring of noncoding variants from GWAS or clinical sequencing studies: by comparing predicted epigenomic profiles between reference and alternate alleles, Basenji generates quantitative evidence for or against a causal regulatory role for each variant across many cell types simultaneously. The model has also been applied to study the mechanistic basis of eQTLs by identifying which cell types and regulatory tracks are most perturbed by a given variant. Researchers have used Basenji predictions for comparative genomics, studying how regulatory sequences have changed across mammalian evolution, and for prioritizing regulatory elements for CRISPR perturbation experiments. The architecture and training pipeline served as the direct foundation for both Enformer and Borzoi, the leading models in this lineage.

Impact

Basenji established a new paradigm for quantitative sequence-to-function modeling at genome scale, shifting the field from binary accessibility prediction toward continuous, multi-track regulatory prediction. Its introduction of dilated convolutions for long-range regulatory integration was widely adopted in subsequent genomic deep learning architectures. The Genome Research paper has been cited extensively and is recognized as a landmark contribution to computational regulatory genomics. The model directly enabled the development of Enformer, which replaced convolutional long-range integration with transformer self-attention, and Borzoi, which further expanded context and resolution. A notable limitation is that the model processes sequence only, without incorporating any experimental epigenomic measurements that could condition predictions on cell-type-specific chromatin state, and the convolutional architecture has a fixed maximum receptive field that cannot grow beyond the dilated stack's coverage. Despite these constraints, Basenji remains a widely used reference model and its open codebase continues to serve as the foundation for sequence-to-function research at Calico and beyond.

Tags

gene expressionregulatory genomicsvariant effect predictionCNNself-supervisedchromatingenomics

Resources

GitHub RepositoryResearch Paper