bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

Basset

Harvard University

Deep convolutional neural network that learns the regulatory code of DNA accessibility from DNase-seq data across 164 cell types, enabling variant effect prediction at cis-regulatory elements.

Released: 2016

Overview

Basset is a deep convolutional neural network developed by David Kelley, Jasper Snoek, and John Rinn at Harvard University and published in Genome Research in 2016. It was among the first computational models to apply deep learning to the problem of learning the regulatory code of DNA accessibility — the rules encoded in raw sequence that determine where chromatin is open and accessible to transcription factors across different cell types. Basset established a foundational paradigm for using convolutional neural networks to decode cis-regulatory grammar directly from genomic sequence, predating and directly inspiring the Basenji and Enformer lineage of models.

The central scientific problem Basset addresses is DNA accessibility prediction: given a 600-base-pair window of genomic sequence, the model predicts which of 164 cell types will have accessible chromatin at that locus, as measured by DNase-seq. Chromatin accessibility is a critical readout of cis-regulatory activity — open chromatin marks active enhancers, promoters, and insulators. By training a CNN on a large compendium of ENCODE DNase-seq datasets, Basset learned to recognize transcription factor binding motifs and the combinatorial regulatory logic with which they interact to determine cell-type-specific accessibility, without any explicit prior knowledge of transcription factor binding sites.

At the time of its publication, Basset represented a substantial advance over existing methods for predicting DNA accessibility and regulatory activity from sequence. Prior approaches such as linear models and shallow neural networks had limited capacity to learn the complex combinatorial interactions between sequence motifs that determine regulatory specificity across cell types. By training a three-layer deep convolutional network on a compendium spanning 164 cell types, Basset was able to simultaneously learn relevant sequence motifs at multiple scales and the regulatory logic governing their cell-type-specific activity. It was one of the first genomic deep learning models to demonstrate that convolutional filters learned de novo from sequence data corresponded closely to known transcription factor binding motifs, providing biological interpretability alongside improved predictive performance.

Key Features

  • Multi-cell-type accessibility prediction: Predicts binary chromatin accessibility across 164 human cell types simultaneously from a 600 bp sequence window, learning shared and cell-type-specific regulatory logic in a single model.
  • De novo motif discovery: Convolutional filters in the first layer learn sequence motifs de novo from data, with many filters corresponding closely to known transcription factor binding sites from JASPAR and ENCODE, providing biologically interpretable learned representations.
  • Variant effect scoring for GWAS: Predicts the differential chromatin accessibility between reference and alternate alleles at single nucleotide variants, enabling prioritization of putatively causal regulatory variants from genome-wide association studies.
  • Open-chromatin classification across cell contexts: Simultaneously models accessible chromatin across diverse cell types including lymphocytes, fibroblasts, embryonic stem cells, and cancer cell lines, capturing the full spectrum of human cis-regulatory diversity in the ENCODE compendium.
  • Interpretation via in-silico mutagenesis: Supports systematic perturbation of individual bases within an input sequence to produce position-by-position activity maps, enabling identification of which bases within a regulatory element contribute most to predicted accessibility.
  • Lua/Torch implementation with retraining support: Released as an open-source Lua and Torch framework (later ported to PyTorch-compatible formats) with training scripts and pre-trained model weights for immediate use on new datasets.

Technical Details

Basset uses a three-layer deep convolutional architecture designed for fixed-length 600 bp DNA sequences represented as one-hot encoded matrices (4 channels × 600 positions). The first convolutional layer applies 300 filters of width 19 bp to capture core transcription factor binding motifs; subsequent layers with filters of width 11 bp and 7 bp capture higher-order combinations of motifs and inter-motif spacing rules. Max-pooling layers between convolutional blocks reduce spatial dimensionality while preserving the most salient local features. The convolutional stack feeds into two fully connected layers (1,000 and 164 units) producing sigmoid-activated accessibility predictions for each of 164 cell types. The network was trained on 2.2 million genomic sequences drawn from a 164-cell-type ENCODE DNase-seq compendium, with training, validation, and test sets defined by chromosome-level splits to prevent data leakage across closely related sequences. Despite its relatively simple architecture compared to later models, Basset achieved accuracy on par with or exceeding ensemble methods and kernel-based approaches on the ENCODE accessibility prediction benchmark. An important design choice was the use of a joint multi-task learning objective across all 164 cell types, which allowed the model to learn shared regulatory features while maintaining cell-type-specific discrimination — a strategy subsequently adopted and scaled by Basenji, Basenji2, and Enformer.

Applications

Basset was applied most prominently to the interpretation of GWAS-identified noncoding variants. By scoring the predicted change in chromatin accessibility between reference and alternate alleles of common SNPs, Basset provided a sequence-based functional filter that enriched for putatively causal variants in trait-associated loci. The authors demonstrated that Basset-predicted accessibility changes were significantly elevated for GWAS index SNPs compared to SNPs in linkage disequilibrium, providing an early proof-of-concept for deep learning-based variant prioritization. Basset was also used to identify which cell types are most affected by regulatory variants — a capability that became foundational for connecting GWAS signals to disease-relevant tissues. Additionally, the Basset framework was adopted for interpreting regulatory elements in cancer genomics, studying enhancer activity changes in differentiation, and generating training data for regulatory sequence design experiments.

Impact

Basset is widely recognized as one of the pioneer models of the genomic deep learning field, establishing the CNN paradigm for sequence-to-function prediction that dominated the field from 2016 through 2021. Its demonstration that convolutional filters learn biologically interpretable transcription factor motifs from raw sequence data — without any prior knowledge of binding sites — was highly influential and helped build confidence in the mechanistic relevance of deep learning representations in genomics. The Genome Research paper has accumulated thousands of citations and is cited as a foundational reference in virtually all subsequent papers on regulatory genomic deep learning, including Basenji, Enformer, DeepSEA, and dozens of related models. Kelley subsequently moved to Calico Life Sciences and led the development of Basenji and Enformer, both of which built directly on Basset's architecture and training paradigm. A key limitation of Basset is its 600 bp receptive field, which cannot capture the long-range enhancer-promoter interactions that are central to cell-type-specific gene regulation — a limitation explicitly addressed by the dilated convolutional architecture of Basenji and the transformer attention of Enformer.

Tags

variant effect predictionregulatory genomicsCNNself-supervisedchromatingenomics

Resources

GitHub RepositoryResearch Paper