Overview

Puffin is an explainable deep learning-inspired sequence model for transcription initiation, developed by Kseniia Dudnyk, Chenlai Shi, and Jian Zhou at UT Southwestern Medical Center's Lyda Hill Department of Bioinformatics. Published in Science in April 2024 after initial posting as a bioRxiv preprint in June 2023, Puffin addresses a longstanding challenge in gene regulation: despite decades of research, the sequence rules governing transcription initiation at human promoters remained incompletely understood. Puffin provides the most detailed and mechanistically transparent account to date of how promoter sequences determine transcription start site selection and strength.

The central scientific contribution of Puffin is its prioritization of interpretability over raw predictive power. Unlike large-scale "black box" models such as Enformer that predict regulatory activity from sequence with high accuracy but limited mechanistic transparency, Puffin was specifically designed to decompose promoter activity into a minimal, interpretable set of sequence rules. The model achieves this by adopting an additive, position-specific effect framework: transcription initiation at any human promoter is decomposed into contributions from three classes of sequence features — core transcription factor binding motifs, initiator elements that fine-tune transcription start site selection, and trinucleotide composition that captures residual sequence context — each with a distinct, empirically learned position-by-position effect profile.

The model was applied to tens of thousands of recognized human promoters characterized by CAGE (Cap Analysis of Gene Expression) data, enabling the learning of quantitative position-specific effect curves for each sequence feature. The resulting model explains the vast majority of human promoter activity through these simple additive rules, demonstrating that promoter sequence logic — despite the complexity of the transcription initiation machinery — is surprisingly reducible to a compact set of generalizable principles. The work was featured prominently in UT Southwestern's communications as a demonstration of machine learning illuminating fundamental biology.

Key Features

Interpretable additive framework: Transcription initiation is modeled as the additive sum of position-specific contributions from motifs, initiators, and trinucleotides, enabling complete attribution of predicted activity to individual sequence features at base-pair resolution.
Three classes of sequence determinants: Identifies three functionally distinct types of promoter sequence elements — core transcription factor motifs (major contributors), initiator elements (fine-tune TSS localization), and trinucleotide context (residual dependencies) — each with learned position-specific effect profiles.
Base-pair resolution predictions: Predicts CAGE-seq signal at single-nucleotide resolution across promoter regions, enabling precise identification of transcription start sites and the sequence features that activate or repress them.
Compact explanatory model: Despite explaining the majority of human promoter activity, the model uses a small number of sequence rules — far fewer than the number of parameters in black-box deep learning models — making mechanistic interpretation feasible.
Validated on activating and repressive elements: The position-specific effect profiles capture both activating sequence patterns (which increase transcription when placed near the TSS) and repressive patterns (which decrease transcription at certain positions), reflecting the full regulatory logic of promoter architecture.
Free web server availability: A publicly accessible web server at tss.zhoulab.io allows any researcher to submit a promoter sequence and receive a base-pair-resolution decomposition of predicted transcription initiation contributions from each sequence element.

Technical Details

Puffin uses a convolutional neural network architecture trained on CAGE-seq data from ENCODE and FANTOM5 to predict transcription initiation signals at base-pair resolution. The key design choice that enables interpretability is the use of an additive model structure: rather than allowing arbitrary nonlinear interactions between sequence features, Puffin constrains the model to express predicted activity as a sum of position-specific contributions from learned sequence motifs. This is implemented by having the model learn a set of position-weight-matrix-like motif detectors (the convolutional filters) together with a corresponding set of position-specific effect curves that specify how each motif's presence at each position relative to the TSS contributes to the overall transcription initiation signal. Crucially, the model also learns initiator elements — short sequence patterns specifically associated with TSS selection — and trinucleotide features that capture local sequence composition. Training was conducted on CAGE-seq from hundreds of human cell types, focusing on the 500 bp proximal promoter region around annotated TSSs. The model achieves high Pearson correlations with measured CAGE signal on held-out promoters while maintaining strict interpretability through the additive constraint. For validation, the authors verified that the learned motifs correspond to known transcription factor binding sites and that the position-specific effect curves match the empirically known positional preferences of promoter elements such as TATA boxes (approximately -30 from the TSS), initiator elements (at the TSS), and downstream promoter elements (+30).

Applications

Puffin is most directly applicable to promoter biology and gene regulation research where mechanistic understanding of how sequences drive transcription initiation is the goal. Researchers can use the web server or GitHub code to analyze any human promoter sequence, receive a base-pair-resolution map of predicted transcription initiation, and obtain an attribution of how much each sequence feature at each position contributes to the predicted activity. This makes Puffin immediately useful for interpreting promoter mutations — whether naturally occurring variants from population genomics or clinical sequencing, or engineered mutations from promoter mutagenesis experiments. For synthetic biology applications, Puffin's explicit position-specific rules enable rational design of promoters with desired activity levels and TSS architectures by identifying which sequence elements to include and where to place them. The model is also valuable for studying promoter evolution, as the interpretable sequence rules can be used to compare promoter architectures across species and identify conserved versus divergent regulatory elements.

Impact

Puffin's publication in Science reflects the field's recognition that interpretable, rule-based models can provide biological insights that are inaccessible to black-box deep learning approaches despite the latter's superior predictive accuracy on held-out data. The demonstration that human promoter activity can be largely explained by a compact set of additive, position-specific sequence rules was scientifically significant: it provided quantitative validation for the longstanding hypothesis that transcription initiation is governed by a relatively simple "promoter code" based on the combinatorial arrangement of a small number of core sequence elements. This finding has implications for our understanding of how gene regulation evolved and how regulatory sequence mutations lead to disease. The web server at tss.zhoulab.io has made the model accessible to wet-lab biologists without computational expertise, broadening its adoption beyond the computational biology community. The model represents the Jian Zhou lab's approach to developing explainable AI for regulatory genomics, complementing their earlier work on predicting noncoding variant effects with models such as Sei and Beluga.

Overview

Key Features

Interpretable additive framework: Transcription initiation is modeled as the additive sum of position-specific contributions from motifs, initiators, and trinucleotides, enabling complete attribution of predicted activity to individual sequence features at base-pair resolution.

Three classes of sequence determinants: Identifies three functionally distinct types of promoter sequence elements — core transcription factor motifs (major contributors), initiator elements (fine-tune TSS localization), and trinucleotide context (residual dependencies) — each with learned position-specific effect profiles.

Base-pair resolution predictions: Predicts CAGE-seq signal at single-nucleotide resolution across promoter regions, enabling precise identification of transcription start sites and the sequence features that activate or repress them.

Compact explanatory model: Despite explaining the majority of human promoter activity, the model uses a small number of sequence rules — far fewer than the number of parameters in black-box deep learning models — making mechanistic interpretation feasible.

Validated on activating and repressive elements: The position-specific effect profiles capture both activating sequence patterns (which increase transcription when placed near the TSS) and repressive patterns (which decrease transcription at certain positions), reflecting the full regulatory logic of promoter architecture.

Free web server availability: A publicly accessible web server at tss.zhoulab.io allows any researcher to submit a promoter sequence and receive a base-pair-resolution decomposition of predicted transcription initiation contributions from each sequence element.

Technical Details

Applications

Impact

Puffin

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

Puffin

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources