Basset

Convolutional neural network that predicts DNA accessibility from sequence across 164 DNase-seq cell types, enabling variant effect prediction.

Released: July 2016

Basset is a deep convolutional neural network developed by David Kelley, Jasper Snoek, and John Rinn at Harvard University and published in Genome Research in 2016. It was among the first computational models to apply deep learning to the problem of learning the regulatory code of DNA accessibility — the rules encoded in raw sequence that determine where chromatin is open and accessible to transcription factors across different cell types. Basset established a foundational paradigm for using convolutional neural networks to decode cis-regulatory grammar directly from genomic sequence, predating and directly inspiring the Basenji and Enformer lineage of models.

The central scientific problem Basset addresses is DNA accessibility prediction: given a 600-base-pair window of genomic sequence, the model predicts which of 164 cell types will have accessible chromatin at that locus, as measured by DNase-seq. Chromatin accessibility is a critical readout of cis-regulatory activity — open chromatin marks active enhancers, promoters, and insulators. By training a CNN on a large compendium of ENCODE DNase-seq datasets, Basset learned to recognize transcription factor binding motifs and the combinatorial regulatory logic with which they interact to determine cell-type-specific accessibility, without any explicit prior knowledge of transcription factor binding sites.

At the time of its publication, Basset represented a substantial advance over existing methods for predicting DNA accessibility and regulatory activity from sequence. Prior approaches such as linear models and shallow neural networks had limited capacity to learn the complex combinatorial interactions between sequence motifs that determine regulatory specificity across cell types. By training a three-layer deep convolutional network on a compendium spanning 164 cell types, Basset was able to simultaneously learn relevant sequence motifs at multiple scales and the regulatory logic governing their cell-type-specific activity. It was one of the first genomic deep learning models to demonstrate that convolutional filters learned de novo from sequence data corresponded closely to known transcription factor binding motifs, providing biological interpretability alongside improved predictive performance.

Key Features

Multi-cell-type accessibility prediction: Predicts binary chromatin accessibility across 164 human cell types simultaneously from a 600 bp sequence window, learning shared and cell-type-specific regulatory logic in a single model.
De novo motif discovery: Convolutional filters in the first layer learn sequence motifs de novo from data, with many filters corresponding closely to known transcription factor binding sites from JASPAR and ENCODE, providing biologically interpretable learned representations.
Variant effect scoring for GWAS: Predicts the differential chromatin accessibility between reference and alternate alleles at single nucleotide variants, enabling prioritization of putatively causal regulatory variants from genome-wide association studies.
Open-chromatin classification across cell contexts: Simultaneously models accessible chromatin across diverse cell types including lymphocytes, fibroblasts, embryonic stem cells, and cancer cell lines, capturing the full spectrum of human cis-regulatory diversity in the ENCODE compendium.
Interpretation via in-silico mutagenesis: Supports systematic perturbation of individual bases within an input sequence to produce position-by-position activity maps, enabling identification of which bases within a regulatory element contribute most to predicted accessibility.
Lua/Torch implementation with retraining support: Released as an open-source Lua and Torch framework (later ported to PyTorch-compatible formats) with training scripts and pre-trained model weights for immediate use on new datasets.

Technical Details

Basset uses a three-layer deep convolutional architecture designed for fixed-length 600 bp DNA sequences represented as one-hot encoded matrices (4 channels × 600 positions). The first convolutional layer applies 300 filters of width 19 bp to capture core transcription factor binding motifs; subsequent layers with filters of width 11 bp and 7 bp capture higher-order combinations of motifs and inter-motif spacing rules. Max-pooling layers between convolutional blocks reduce spatial dimensionality while preserving the most salient local features. The convolutional stack feeds into two fully connected layers (1,000 and 164 units) producing sigmoid-activated accessibility predictions for each of 164 cell types. The network was trained on 2.2 million genomic sequences drawn from a 164-cell-type ENCODE DNase-seq compendium, with training, validation, and test sets defined by chromosome-level splits to prevent data leakage across closely related sequences. Despite its relatively simple architecture compared to later models, Basset achieved accuracy on par with or exceeding ensemble methods and kernel-based approaches on the ENCODE accessibility prediction benchmark. An important design choice was the use of a joint multi-task learning objective across all 164 cell types, which allowed the model to learn shared regulatory features while maintaining cell-type-specific discrimination — a strategy subsequently adopted and scaled by Basenji, Basenji2, and Enformer.

Applications

Basset was applied most prominently to the interpretation of GWAS-identified noncoding variants. By scoring the predicted change in chromatin accessibility between reference and alternate alleles of common SNPs, Basset provided a sequence-based functional filter that enriched for putatively causal variants in trait-associated loci. The authors demonstrated that Basset-predicted accessibility changes were significantly elevated for GWAS index SNPs compared to SNPs in linkage disequilibrium, providing an early proof-of-concept for deep learning-based variant prioritization. Basset was also used to identify which cell types are most affected by regulatory variants — a capability that became foundational for connecting GWAS signals to disease-relevant tissues. Additionally, the Basset framework was adopted for interpreting regulatory elements in cancer genomics, studying enhancer activity changes in differentiation, and generating training data for regulatory sequence design experiments.

Impact

Basset is widely recognized as one of the pioneer models of the genomic deep learning field, establishing the CNN paradigm for sequence-to-function prediction that dominated the field from 2016 through 2021. Its demonstration that convolutional filters learn biologically interpretable transcription factor motifs from raw sequence data — without any prior knowledge of binding sites — was highly influential and helped build confidence in the mechanistic relevance of deep learning representations in genomics. The Genome Research paper has accumulated thousands of citations and is cited as a foundational reference in virtually all subsequent papers on regulatory genomic deep learning, including Basenji, Enformer, DeepSEA, and dozens of related models. Kelley subsequently moved to Calico Life Sciences and led the development of Basenji and Enformer, both of which built directly on Basset's architecture and training paradigm. A key limitation of Basset is its 600 bp receptive field, which cannot capture the long-range enhancer-promoter interactions that are central to cell-type-specific gene regulation — a limitation explicitly addressed by the dilated convolutional architecture of Basenji and the transformer attention of Enformer.

Citation

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

Preprint

Kelley, D. R., et al. (2015) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. bioRxiv.

DOI: 10.1101/gr.200535.115

Recent citations

Papers that recently cited this model.

Construction and validation of a phenotypic prediction model for bacterial gentamicin resistance using deep learning with gene sequences
Jun Li, Siyan Xue, Lingxuan Hou, et al.
Microbiology spectrum · Jul 2026
0
Neuronal stop-codon readthrough is associated with ribosome pausing and alters protein localization in Drosophila
Toshiharu Ichinose, K. Sakuma, Hiroto Anbo, et al.
bioRxiv · Jul 2026
0
Unlocking the Regulatory Genome: Interpreting the Clinical Impact of Noncoding Variants in Genetic Cardiomyopathies.
Aaron Renberg, S. Coppersmith, Adam S. Helms
Circulation Genomic and Precision Medicine · Jul 2026
0

Top citations

The most-cited papers that cite this model.

The Human Transcription Factors.
Samuel A. Lambert, A. Jolma, L. Campitelli, et al.
Cell · Oct 2018
2.8K
Deep learning for healthcare: review, opportunities and challenges
Riccardo Miotto, Fei Wang, Shuang Wang, et al.
Briefings Bioinform. · Nov 2018
2.6K
Opportunities and obstacles for deep learning in biology and medicine
Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, et al.
bioRxiv · May 2017
2KInfluential
WILDS: A Benchmark of in-the-Wild Distribution Shifts
Pang Wei Koh, Shiori Sagawa, H. Marklund, et al.
International Conference on Machine Learning · Dec 2020
1.8K
A guide to machine learning for biologists
Joe G. Greener, S. Kandathil, Lewis Moffat, et al.
Nature reviews. Molecular cell biology · Sep 2021
1.6K

Citations

Total Citations955

Influential74

References69

GitHub

Stars268

Forks106

Open Issues16

Contributors2

Last Push5y ago

LanguageJupyter Notebook

LicenseMIT

Fields of citing research

Computer Science36%
Biology36%
Medicine28%
Mathematics1%
Engineering1%
Environmental Science1%
Chemistry1%
Agricultural and Food Sciences1%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

80Open

Usability — can I run it?95

Reproducibility — can I retrain it?66

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Documentation

Key Features

Multi-cell-type accessibility prediction: Predicts binary chromatin accessibility across 164 human cell types simultaneously from a 600 bp sequence window, learning shared and cell-type-specific regulatory logic in a single model.

De novo motif discovery: Convolutional filters in the first layer learn sequence motifs de novo from data, with many filters corresponding closely to known transcription factor binding sites from JASPAR and ENCODE, providing biologically interpretable learned representations.

Variant effect scoring for GWAS: Predicts the differential chromatin accessibility between reference and alternate alleles at single nucleotide variants, enabling prioritization of putatively causal regulatory variants from genome-wide association studies.

Open-chromatin classification across cell contexts: Simultaneously models accessible chromatin across diverse cell types including lymphocytes, fibroblasts, embryonic stem cells, and cancer cell lines, capturing the full spectrum of human cis-regulatory diversity in the ENCODE compendium.

Interpretation via in-silico mutagenesis: Supports systematic perturbation of individual bases within an input sequence to produce position-by-position activity maps, enabling identification of which bases within a regulatory element contribute most to predicted accessibility.

Lua/Torch implementation with retraining support: Released as an open-source Lua and Torch framework (later ported to PyTorch-compatible formats) with training scripts and pre-trained model weights for immediate use on new datasets.

Technical Details

Applications

Impact

Top citations

The most-cited papers that cite this model.

The Human Transcription Factors.

Samuel A. Lambert, A. Jolma, L. Campitelli, et al.

Cell · Oct 2018

2.8K

Deep learning for healthcare: review, opportunities and challenges

Riccardo Miotto, Fei Wang, Shuang Wang, et al.

Briefings Bioinform. · Nov 2018

2.6K

Opportunities and obstacles for deep learning in biology and medicine

Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, et al.

bioRxiv · May 2017

2KInfluential

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Pang Wei Koh, Shiori Sagawa, H. Marklund, et al.

International Conference on Machine Learning · Dec 2020

1.8K

A guide to machine learning for biologists

Joe G. Greener, S. Kandathil, Lewis Moffat, et al.

Nature reviews. Molecular cell biology · Sep 2021

1.6K

Basset

#Key Features

#Technical Details

#Applications

#Impact

Citation

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

Recent citations

Top citations

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Basset

#Key Features

#Technical Details

#Applications

#Impact

Citation

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

Recent citations

Top citations

WILDS: A Benchmark of in-the-Wild Distribution Shifts

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact