Overview

Basenji2 is an updated and refined version of the Basenji deep convolutional neural network, developed at Calico Life Sciences by David Kelley. Published in PLOS Computational Biology in 2020, Basenji2 introduced architectural improvements and a cross-species training strategy that substantially improved the model's ability to predict regulatory sequence activity across both human and mouse genomes. The primary innovation was joint training on paired human and mouse genomic data, allowing the model to leverage evolutionary conservation as a form of biological regularization and dramatically improve its generalization to noncoding regulatory sequences.

The original Basenji model demonstrated that dilated convolutional networks could predict quantitative epigenomic profiles from DNA sequence at chromosomal scale, but its training was largely limited to human data. Basenji2 addressed this by reformulating the training procedure to simultaneously predict thousands of human and mouse genomic tracks in a single shared model, using the evolutionary relationship between orthologous sequences to improve the model's understanding of which sequence features are functionally relevant versus species-specific. This cross-species approach yielded significant improvements in regulatory prediction accuracy and enabled the model to better distinguish conserved regulatory logic from species-specific sequence drift.

Basenji2 represented the immediate predecessor to Enformer, which replaced the dilated convolutional long-range integration with transformer self-attention over a dramatically extended 200 kilobase context window. The Enformer paper directly benchmarked against Basenji2, reporting that Enformer substantially outperformed Basenji2 across all assay types (paired Wilcoxon P < 10^-38), establishing the performance gap that transformer-based approaches opened over purely convolutional methods for long-range regulatory modeling. This context makes Basenji2 an important reference model in the sequence-to-function literature, representing the peak of CNN-only performance before the transformer era in genomic deep learning.

Key Features

Cross-species joint training: Simultaneously predicts regulatory activity in human and mouse genomes from a single model, leveraging evolutionary conservation across ~80 million years of divergence to improve learned sequence representations.
Improved architectural regularization: Refined convolutional block design with improved batch normalization and dropout placement, yielding more stable training dynamics and better generalization on held-out chromosomes.
Expanded multi-track output: Predicts thousands of human and mouse genomic tracks simultaneously, including CAGE-seq, RNA-seq, DNase-seq, ATAC-seq, and ChIP-seq data from ENCODE, Roadmap Epigenomics, and matched mouse datasets.
Vertebrate regulatory element discovery: Cross-species training improves identification of conserved transcription factor binding sites, promoters, and enhancers that are under purifying selection and likely to be functionally important.
Quantitative variant effect scoring: Scores single nucleotide variants by comparing predicted regulatory profiles between reference and alternate alleles, with outputs interpretable across matched human and mouse contexts.
Shared open-source codebase: Released on the Calico Basenji GitHub repository alongside model weights, enabling direct comparison with Basenji and Enformer using common training and evaluation pipelines.

Technical Details

Basenji2 retains the core dilated convolutional architecture of its predecessor but incorporates several refinements. The input DNA is one-hot encoded and processed through a stack of convolutional blocks with progressively dilated filters, expanding the effective receptive field to approximately 40 kilobases of genomic context. Cross-species training was implemented by treating human and mouse genomes as separate but related datasets, with paired orthologous loci contributing to shared sequence representations while output heads were split to predict species-specific genomic tracks. Training data comprised approximately 5,313 human genomic datasets from ENCODE and FANTOM5, alongside matched mouse datasets spanning 1,643 experimental tracks, representing a substantial expansion over the Basenji training regime. The model was optimized with a Poisson negative log-likelihood loss on read-count-like targets binned at 128 bp resolution. Evaluation on held-out chromosomes showed consistent improvements over Basenji in CAGE-seq expression prediction, with mean Pearson correlations for protein-coding gene expression increasing across human cell types. When Enformer was introduced in 2021, it was benchmarked directly against Basenji2, and the 200 kb transformer model improved median correlation across all output tracks from ~0.636 (Basenji2) to ~0.687 (Enformer) on held-out chromosomes, quantifying the gain from transformer-based long-range modeling.

Applications

Basenji2 is used as a reference model for regulatory genomics benchmarks and as a direct comparison point when evaluating newer sequence-to-function architectures. Practically, it serves the same applications as its successor Enformer at lower computational cost: scoring noncoding variants from GWAS for their predicted functional impact on gene expression and chromatin state, studying the regulatory consequences of CRISPR perturbations in silico, and identifying transcription factor binding sites from sequence features via gradient-based attribution methods (such as DeepLIFT or in-silico mutagenesis). The cross-species training strategy has been particularly useful for researchers studying regulatory evolution between human and mouse, enabling direct comparison of predicted regulatory activity at orthologous loci. Basenji2 has also been used in transfer learning pipelines, where pre-trained convolutional representations are fine-tuned on smaller datasets for specialized regulatory prediction tasks.

Impact

Basenji2 occupies a historically important position in the genomic deep learning literature as the state-of-the-art sequence-to-function model immediately preceding the transformer revolution in regulatory genomics. Its cross-species training strategy — later adopted and extended by Enformer and Borzoi — demonstrated that joint human-mouse training provides a broadly applicable inductive bias for learning conserved regulatory logic. The PLOS Computational Biology paper established a rigorous benchmark against which subsequent models, including Enformer, were directly compared, anchoring the quantitative performance improvements of the transformer era. The model remains available through the Calico GitHub repository and is still used in regulatory genomics workflows where the full context window of Enformer is not required, as Basenji2 can be run with substantially lower GPU memory requirements on the approximately 40 kb context it was designed for.

Overview

Key Features

Cross-species joint training: Simultaneously predicts regulatory activity in human and mouse genomes from a single model, leveraging evolutionary conservation across ~80 million years of divergence to improve learned sequence representations.

Improved architectural regularization: Refined convolutional block design with improved batch normalization and dropout placement, yielding more stable training dynamics and better generalization on held-out chromosomes.

Expanded multi-track output: Predicts thousands of human and mouse genomic tracks simultaneously, including CAGE-seq, RNA-seq, DNase-seq, ATAC-seq, and ChIP-seq data from ENCODE, Roadmap Epigenomics, and matched mouse datasets.

Vertebrate regulatory element discovery: Cross-species training improves identification of conserved transcription factor binding sites, promoters, and enhancers that are under purifying selection and likely to be functionally important.

Quantitative variant effect scoring: Scores single nucleotide variants by comparing predicted regulatory profiles between reference and alternate alleles, with outputs interpretable across matched human and mouse contexts.

Shared open-source codebase: Released on the Calico Basenji GitHub repository alongside model weights, enabling direct comparison with Basenji and Enformer using common training and evaluation pipelines.

Technical Details

Applications

Impact

Basenji2

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

Basenji2

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources