bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

EpiGePT

Tsinghua University

Transformer model predicting context-specific epigenomic signals across cell types using DNA sequence and transcription factor activity profiles.

Released: 2023
Parameters: 71,300,000

Overview

EpiGePT is a pretrained transformer-based language model for predicting context-specific human epigenomic signals from DNA sequence. Developed in the Rui Jiang laboratory at Tsinghua University in collaboration with Stanford University, the model addresses a fundamental challenge in epigenomics: the same genomic sequence gives rise to dramatically different chromatin states across cell types, because the regulatory landscape is shaped by the transcription factors expressed in each cellular context. EpiGePT encodes that context explicitly, enabling cross-cell-type prediction of chromatin accessibility and histone modifications without requiring experimental data from the target cell type.

The core innovation is a dual-input design that combines a one-hot encoded DNA sequence spanning 128 kilobase pairs with a transcription factor profile vector describing the binding status and expression levels of 711 key TFs in the cell type of interest. This context vector is integrated at the token-embedding level, making every layer of the transformer aware of the cellular environment. The model was first released as a bioRxiv preprint in July 2023 and subsequently published in Genome Biology in December 2024.

EpiGePT was benchmarked against Enformer — previously the leading sequence-to-function model — on 78 epigenomic tracks across 19 unseen cell types, achieving a mean Pearson correlation coefficient of 0.510 versus Enformer's 0.440, a 12.3% improvement. For DNase-seq (chromatin accessibility) specifically, EpiGePT reached a mean PCC of 0.710 compared to 0.455 for Enformer, demonstrating the value of explicit cellular context conditioning.

Key Features

  • Context-dependent prediction: A 711-dimensional transcription factor profile vector, encoding both TF binding status and RNA expression, conditions all predictions on the specific cellular environment rather than producing a single universal epigenomic map.
  • Eight epigenomic signal types: The model jointly predicts DNase-seq (chromatin accessibility), CTCF binding, and six histone modifications (H3K27ac, H3K4me3, H3K36me3, H3K27me3, H3K9me3, H3K4me1) in a multi-task learning framework.
  • 3D genome integration: A novel cosine-similarity loss term allows the model to incorporate Hi-C and HiChIP chromatin contact data during training, enabling enhancer-promoter interaction prediction without architectural changes.
  • Wide genomic context: Accepts 128 kbp input windows (1,000 bins of 128 bp), providing sufficient range to capture distal regulatory elements and long-range chromatin interactions.
  • Variant effect prediction: The model scores single-nucleotide variants by computing the difference in predicted epigenomic signal between reference and alternate alleles, supporting eQTL interpretation and pathogenic variant prioritization.
  • Web server access: An online prediction interface at health.tsinghua.edu.cn/epigept allows users to obtain epigenomic predictions for specified genomic regions and cell types without local installation.

Technical Details

EpiGePT comprises 71.3 million parameters organized into four sequential modules. The sequence module applies five convolutional blocks with max-pooling to one-hot encoded DNA input, reducing the 128 kbp window into a compact sequence feature representation of 256 dimensions per bin. The TF module embeds the 711-dimensional transcription factor profile and concatenates it with the sequence features to form 968-dimensional tokens. The transformer module then processes these tokens through 16 stacked encoder layers (for DNase-seq prediction) or 12 layers (for histone modifications), each with 8 attention heads; this depth allows the self-attention mechanism to capture long-range regulatory interactions across the full genomic window. A final multi-task prediction module uses a fully connected layer to produce simultaneous predictions for all eight epigenomic marks.

Training data were drawn from the ENCODE project, covering DNase-seq across 129 cellular contexts (1,175,374 genomic regions) and eight histone modification signals from 104 cell types aligned to the hg38 reference genome. Missing signal tracks across cell types were handled through a masked multi-task training strategy, allowing the model to leverage partially observed data without discarding any cell type. The training scheme treats context-region pairs as independent instances, which substantially increases the effective number of training examples compared to training on genomic regions alone.

Applications

EpiGePT is designed for researchers studying gene regulation across diverse cellular contexts. Computational biologists can use it to impute missing epigenomic assays for cell types that lack experimental data, reducing sequencing costs when profiling rare or difficult-to-culture cell populations. The variant scoring capability makes EpiGePT applicable to functional annotation of GWAS hits, particularly for noncoding variants that alter transcription factor binding or chromatin accessibility — EpiGePT achieves a mean auPRC of 0.922 for eQTL classification in lung tissue. Researchers studying enhancer biology can leverage the 3D genome module to predict enhancer-promoter interactions and prioritize candidate regulatory elements for experimental validation. The web server makes the model accessible to wet-lab groups without high-performance computing resources.

Impact

EpiGePT represents a significant methodological advance in context-aware regulatory genomics modeling. Its explicit treatment of transcription factor activity as a conditioning signal, rather than training a separate model per cell type or conflating all cellular contexts, offers a principled framework for cross-cell-type generalization. The model's performance improvements over Enformer on held-out cell types demonstrate that incorporating cellular context into the architecture itself is more effective than relying on sequence alone. A notable limitation is that predictions are conditioned on TF expression and binding profiles that must be measured experimentally, so the model cannot be applied in fully data-free settings for novel cell types without at least RNA-seq or motif-based TF activity estimates. The model and web server are freely available for non-commercial research use, and the codebase supports fine-tuning on user-provided datasets.

Citations

EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics

Gao, Z., et al. (2024) EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics. Genome Biology.

DOI: 10.1186/s13059-024-03449-7

EpiGePT: a Pretrained Transformer model for epigenomics

Preprint

Gao, Z., et al. (2023) EpiGePT: a Pretrained Transformer model for epigenomics. bioRxiv.

DOI: 10.1101/2023.07.15.549134

Metrics

GitHub

Stars33
Forks6
Open Issues6
Contributors1
Last Push10mo ago
LanguagePython
LicenseMIT

Citations

Total Citations10
Influential1
References19

Tags

regulatory genomicsfoundation modelchromatinepigenomicstranscription factors

Resources

GitHub RepositoryResearch PaperOfficial Website