BERT-based model pretrained on 15-state ROADMAP chromatin annotations across 127 human cell types to uncover chromatin-state motifs and predict gene expression.
ChromBERT is a BERT-based foundation model that learns the "grammar" of chromatin states across the human genome to uncover recurring chromatin-state motifs. Developed by Seohyun Lee and colleagues in the Nakato lab at the University of Tokyo and first posted to bioRxiv in July 2024 (revised 2026), it adapts the masked-language-model approach of DNABERT to sequences of chromatin-state annotations rather than raw nucleotides.
The model is pretrained on 15-state chromatin annotations derived from 127 human cell and tissue types in the ROADMAP Epigenomics consortium, treating each genomic position's chromatin state as a token. By combining this transformer pretraining with Dynamic Time Warping (DTW) for clustering, ChromBERT identifies previously unrecognized patterns of chromatin states—chromatin-state "motifs"—and provides representations that can be fine-tuned for downstream genomic tasks.
This entry refers specifically to the University of Tokyo chromatin-state-motif ChromBERT built on ROADMAP annotations; it is distinct from an unrelated regulatory-network model that shares the name in the literature. ChromBERT sits alongside epigenomic language models, extending genomic language modeling from DNA sequence to the chromatin-state layer.
ChromBERT adapts the DNABERT architecture—a BERT-style transformer with masked-language-model pretraining—to chromatin-state-annotated genome sequences. Inputs are 15-chromatin-state annotations from 127 ROADMAP cell and tissue types, tokenized along the genome, with pretrained models provided for both promoter and whole-genome regions. The released package includes data-preprocessing utilities, training scripts, attention-based motif detection, and DTW-based clustering for visualization, plus fine-tuned models for gene expression classification and regression. The authors also describe an 18-state system built on roughly 1,699 IHEC cell types, slated for release alongside the corresponding IHEC publication. Code is available under the Apache 2.0 license.
ChromBERT is intended for genomics and epigenomics researchers studying how chromatin-state patterns organize the genome and relate to gene regulation. Its pretrained and fine-tuned models support gene expression prediction, cell-type classification, and analysis of 3D genome features, while its motif-discovery pipeline helps interpret recurring chromatin-state arrangements around promoters and across the whole genome.
By framing chromatin states as a language and applying transformer pretraining, ChromBERT extends genomic language modeling beyond raw DNA to the epigenomic annotation layer, offering an interpretable route to chromatin-state motifs. Its open Apache-licensed code and ROADMAP-based pretrained models lower the barrier for downstream fine-tuning. As a preprint-stage model, its broader benchmarking and adoption relative to established epigenomic methods continue to develop.