Calico Life Sciences
Updated Basenji architecture enabling cross-species regulatory sequence activity prediction, trained jointly on human and mouse genomes with improved generalization.
Basenji2 is an updated and refined version of the Basenji deep convolutional neural network, developed at Calico Life Sciences by David Kelley. Published in PLOS Computational Biology in 2020, Basenji2 introduced architectural improvements and a cross-species training strategy that substantially improved the model's ability to predict regulatory sequence activity across both human and mouse genomes. The primary innovation was joint training on paired human and mouse genomic data, allowing the model to leverage evolutionary conservation as a form of biological regularization and dramatically improve its generalization to noncoding regulatory sequences.
The original Basenji model demonstrated that dilated convolutional networks could predict quantitative epigenomic profiles from DNA sequence at chromosomal scale, but its training was largely limited to human data. Basenji2 addressed this by reformulating the training procedure to simultaneously predict thousands of human and mouse genomic tracks in a single shared model, using the evolutionary relationship between orthologous sequences to improve the model's understanding of which sequence features are functionally relevant versus species-specific. This cross-species approach yielded significant improvements in regulatory prediction accuracy and enabled the model to better distinguish conserved regulatory logic from species-specific sequence drift.
Basenji2 represented the immediate predecessor to Enformer, which replaced the dilated convolutional long-range integration with transformer self-attention over a dramatically extended 200 kilobase context window. The Enformer paper directly benchmarked against Basenji2, reporting that Enformer substantially outperformed Basenji2 across all assay types (paired Wilcoxon P < 10^-38), establishing the performance gap that transformer-based approaches opened over purely convolutional methods for long-range regulatory modeling. This context makes Basenji2 an important reference model in the sequence-to-function literature, representing the peak of CNN-only performance before the transformer era in genomic deep learning.
Basenji2 retains the core dilated convolutional architecture of its predecessor but incorporates several refinements. The input DNA is one-hot encoded and processed through a stack of convolutional blocks with progressively dilated filters, expanding the effective receptive field to approximately 40 kilobases of genomic context. Cross-species training was implemented by treating human and mouse genomes as separate but related datasets, with paired orthologous loci contributing to shared sequence representations while output heads were split to predict species-specific genomic tracks. Training data comprised approximately 5,313 human genomic datasets from ENCODE and FANTOM5, alongside matched mouse datasets spanning 1,643 experimental tracks, representing a substantial expansion over the Basenji training regime. The model was optimized with a Poisson negative log-likelihood loss on read-count-like targets binned at 128 bp resolution. Evaluation on held-out chromosomes showed consistent improvements over Basenji in CAGE-seq expression prediction, with mean Pearson correlations for protein-coding gene expression increasing across human cell types. When Enformer was introduced in 2021, it was benchmarked directly against Basenji2, and the 200 kb transformer model improved median correlation across all output tracks from ~0.636 (Basenji2) to ~0.687 (Enformer) on held-out chromosomes, quantifying the gain from transformer-based long-range modeling.
Basenji2 is used as a reference model for regulatory genomics benchmarks and as a direct comparison point when evaluating newer sequence-to-function architectures. Practically, it serves the same applications as its successor Enformer at lower computational cost: scoring noncoding variants from GWAS for their predicted functional impact on gene expression and chromatin state, studying the regulatory consequences of CRISPR perturbations in silico, and identifying transcription factor binding sites from sequence features via gradient-based attribution methods (such as DeepLIFT or in-silico mutagenesis). The cross-species training strategy has been particularly useful for researchers studying regulatory evolution between human and mouse, enabling direct comparison of predicted regulatory activity at orthologous loci. Basenji2 has also been used in transfer learning pipelines, where pre-trained convolutional representations are fine-tuned on smaller datasets for specialized regulatory prediction tasks.
Basenji2 occupies a historically important position in the genomic deep learning literature as the state-of-the-art sequence-to-function model immediately preceding the transformer revolution in regulatory genomics. Its cross-species training strategy — later adopted and extended by Enformer and Borzoi — demonstrated that joint human-mouse training provides a broadly applicable inductive bias for learning conserved regulatory logic. The PLOS Computational Biology paper established a rigorous benchmark against which subsequent models, including Enformer, were directly compared, anchoring the quantitative performance improvements of the transformer era. The model remains available through the Calico GitHub repository and is still used in regulatory genomics workflows where the full context window of Enformer is not required, as Basenji2 can be run with substantially lower GPU memory requirements on the approximately 40 kb context it was designed for.