Transformer model predicting context-specific epigenomic signals across cell types using DNA sequence and transcription factor activity profiles.
EpiGePT is a pretrained transformer-based language model for predicting context-specific human epigenomic signals from DNA sequence. Developed in the Rui Jiang laboratory at Tsinghua University in collaboration with Stanford University, the model addresses a fundamental challenge in epigenomics: the same genomic sequence gives rise to dramatically different chromatin states across cell types, because the regulatory landscape is shaped by the transcription factors expressed in each cellular context. EpiGePT encodes that context explicitly, enabling cross-cell-type prediction of chromatin accessibility and histone modifications without requiring experimental data from the target cell type.
The core innovation is a dual-input design that combines a one-hot encoded DNA sequence spanning 128 kilobase pairs with a transcription factor profile vector describing the binding status and expression levels of 711 key TFs in the cell type of interest. This context vector is integrated at the token-embedding level, making every layer of the transformer aware of the cellular environment. The model was first released as a bioRxiv preprint in July 2023 and subsequently published in Genome Biology in December 2024.
EpiGePT was benchmarked against Enformer — previously the leading sequence-to-function model — on 78 epigenomic tracks across 19 unseen cell types, achieving a mean Pearson correlation coefficient of 0.510 versus Enformer's 0.440, a 12.3% improvement. For DNase-seq (chromatin accessibility) specifically, EpiGePT reached a mean PCC of 0.710 compared to 0.455 for Enformer, demonstrating the value of explicit cellular context conditioning.
EpiGePT comprises 71.3 million parameters organized into four sequential modules. The sequence module applies five convolutional blocks with max-pooling to one-hot encoded DNA input, reducing the 128 kbp window into a compact sequence feature representation of 256 dimensions per bin. The TF module embeds the 711-dimensional transcription factor profile and concatenates it with the sequence features to form 968-dimensional tokens. The transformer module then processes these tokens through 16 stacked encoder layers (for DNase-seq prediction) or 12 layers (for histone modifications), each with 8 attention heads; this depth allows the self-attention mechanism to capture long-range regulatory interactions across the full genomic window. A final multi-task prediction module uses a fully connected layer to produce simultaneous predictions for all eight epigenomic marks.
Training data were drawn from the ENCODE project, covering DNase-seq across 129 cellular contexts (1,175,374 genomic regions) and eight histone modification signals from 104 cell types aligned to the hg38 reference genome. Missing signal tracks across cell types were handled through a masked multi-task training strategy, allowing the model to leverage partially observed data without discarding any cell type. The training scheme treats context-region pairs as independent instances, which substantially increases the effective number of training examples compared to training on genomic regions alone.
EpiGePT is designed for researchers studying gene regulation across diverse cellular contexts. Computational biologists can use it to impute missing epigenomic assays for cell types that lack experimental data, reducing sequencing costs when profiling rare or difficult-to-culture cell populations. The variant scoring capability makes EpiGePT applicable to functional annotation of GWAS hits, particularly for noncoding variants that alter transcription factor binding or chromatin accessibility — EpiGePT achieves a mean auPRC of 0.922 for eQTL classification in lung tissue. Researchers studying enhancer biology can leverage the 3D genome module to predict enhancer-promoter interactions and prioritize candidate regulatory elements for experimental validation. The web server makes the model accessible to wet-lab groups without high-performance computing resources.
EpiGePT represents a significant methodological advance in context-aware regulatory genomics modeling. Its explicit treatment of transcription factor activity as a conditioning signal, rather than training a separate model per cell type or conflating all cellular contexts, offers a principled framework for cross-cell-type generalization. The model's performance improvements over Enformer on held-out cell types demonstrate that incorporating cellular context into the architecture itself is more effective than relying on sequence alone. A notable limitation is that predictions are conditioned on TF expression and binding profiles that must be measured experimentally, so the model cannot be applied in fully data-free settings for novel cell types without at least RNA-seq or motif-based TF activity estimates. The model and web server are freely available for non-commercial research use, and the codebase supports fine-tuning on user-provided datasets.
Gao, Z., et al. (2024) EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics. Genome Biology.
DOI: 10.1186/s13059-024-03449-7Gao, Z., et al. (2023) EpiGePT: a Pretrained Transformer model for epigenomics. bioRxiv.
DOI: 10.1101/2023.07.15.549134