UT Southwestern Medical Center
Explainable sequence model for transcription initiation that identifies the minimal set of sequence rules governing human promoter activity at base-pair resolution.
Puffin is an explainable deep learning-inspired sequence model for transcription initiation, developed by Kseniia Dudnyk, Chenlai Shi, and Jian Zhou at UT Southwestern Medical Center's Lyda Hill Department of Bioinformatics. Published in Science in April 2024 after initial posting as a bioRxiv preprint in June 2023, Puffin addresses a longstanding challenge in gene regulation: despite decades of research, the sequence rules governing transcription initiation at human promoters remained incompletely understood. Puffin provides the most detailed and mechanistically transparent account to date of how promoter sequences determine transcription start site selection and strength.
The central scientific contribution of Puffin is its prioritization of interpretability over raw predictive power. Unlike large-scale "black box" models such as Enformer that predict regulatory activity from sequence with high accuracy but limited mechanistic transparency, Puffin was specifically designed to decompose promoter activity into a minimal, interpretable set of sequence rules. The model achieves this by adopting an additive, position-specific effect framework: transcription initiation at any human promoter is decomposed into contributions from three classes of sequence features — core transcription factor binding motifs, initiator elements that fine-tune transcription start site selection, and trinucleotide composition that captures residual sequence context — each with a distinct, empirically learned position-by-position effect profile.
The model was applied to tens of thousands of recognized human promoters characterized by CAGE (Cap Analysis of Gene Expression) data, enabling the learning of quantitative position-specific effect curves for each sequence feature. The resulting model explains the vast majority of human promoter activity through these simple additive rules, demonstrating that promoter sequence logic — despite the complexity of the transcription initiation machinery — is surprisingly reducible to a compact set of generalizable principles. The work was featured prominently in UT Southwestern's communications as a demonstration of machine learning illuminating fundamental biology.
Puffin uses a convolutional neural network architecture trained on CAGE-seq data from ENCODE and FANTOM5 to predict transcription initiation signals at base-pair resolution. The key design choice that enables interpretability is the use of an additive model structure: rather than allowing arbitrary nonlinear interactions between sequence features, Puffin constrains the model to express predicted activity as a sum of position-specific contributions from learned sequence motifs. This is implemented by having the model learn a set of position-weight-matrix-like motif detectors (the convolutional filters) together with a corresponding set of position-specific effect curves that specify how each motif's presence at each position relative to the TSS contributes to the overall transcription initiation signal. Crucially, the model also learns initiator elements — short sequence patterns specifically associated with TSS selection — and trinucleotide features that capture local sequence composition. Training was conducted on CAGE-seq from hundreds of human cell types, focusing on the 500 bp proximal promoter region around annotated TSSs. The model achieves high Pearson correlations with measured CAGE signal on held-out promoters while maintaining strict interpretability through the additive constraint. For validation, the authors verified that the learned motifs correspond to known transcription factor binding sites and that the position-specific effect curves match the empirically known positional preferences of promoter elements such as TATA boxes (approximately -30 from the TSS), initiator elements (at the TSS), and downstream promoter elements (+30).
Puffin is most directly applicable to promoter biology and gene regulation research where mechanistic understanding of how sequences drive transcription initiation is the goal. Researchers can use the web server or GitHub code to analyze any human promoter sequence, receive a base-pair-resolution map of predicted transcription initiation, and obtain an attribution of how much each sequence feature at each position contributes to the predicted activity. This makes Puffin immediately useful for interpreting promoter mutations — whether naturally occurring variants from population genomics or clinical sequencing, or engineered mutations from promoter mutagenesis experiments. For synthetic biology applications, Puffin's explicit position-specific rules enable rational design of promoters with desired activity levels and TSS architectures by identifying which sequence elements to include and where to place them. The model is also valuable for studying promoter evolution, as the interpretable sequence rules can be used to compare promoter architectures across species and identify conserved versus divergent regulatory elements.
Puffin's publication in Science reflects the field's recognition that interpretable, rule-based models can provide biological insights that are inaccessible to black-box deep learning approaches despite the latter's superior predictive accuracy on held-out data. The demonstration that human promoter activity can be largely explained by a compact set of additive, position-specific sequence rules was scientifically significant: it provided quantitative validation for the longstanding hypothesis that transcription initiation is governed by a relatively simple "promoter code" based on the combinatorial arrangement of a small number of core sequence elements. This finding has implications for our understanding of how gene regulation evolved and how regulatory sequence mutations lead to disease. The web server at tss.zhoulab.io has made the model accessible to wet-lab biologists without computational expertise, broadening its adoption beyond the computational biology community. The model represents the Jian Zhou lab's approach to developing explainable AI for regulatory genomics, complementing their earlier work on predicting noncoding variant effects with models such as Sei and Beluga.