ChromBERT (chromatin-state motifs)

Chromatin-state language model pretrained on ROADMAP annotations from 127 human cell types to find chromatin-state motifs and predict gene expression.

Released: July 2024

ChromBERT is a BERT-based foundation model that learns the "grammar" of chromatin states across the human genome to uncover recurring chromatin-state motifs. Developed by Seohyun Lee and colleagues in the Nakato lab at the University of Tokyo and first posted to bioRxiv in July 2024 (revised 2026), it adapts the masked-language-model approach of DNABERT to sequences of chromatin-state annotations rather than raw nucleotides.

The model is pretrained on 15-state chromatin annotations derived from 127 human cell and tissue types in the ROADMAP Epigenomics consortium, treating each genomic position's chromatin state as a token. By combining this transformer pretraining with Dynamic Time Warping (DTW) for clustering, ChromBERT identifies previously unrecognized patterns of chromatin states—chromatin-state "motifs"—and provides representations that can be fine-tuned for downstream genomic tasks.

This entry refers specifically to the University of Tokyo chromatin-state-motif ChromBERT built on ROADMAP annotations; it is distinct from an unrelated regulatory-network model that shares the name in the literature. ChromBERT sits alongside epigenomic language models, extending genomic language modeling from DNA sequence to the chromatin-state layer.

Key Features

Chromatin-state language modeling: Treats 15-state chromatin annotations as a token vocabulary and applies BERT-style masked pretraining to learn chromatin-state grammar across the genome.
Motif discovery: Combines learned representations with Dynamic Time Warping clustering to surface recurring chromatin-state motifs, including patterns not previously characterized.
ROADMAP-scale pretraining: Pretrained on chromatin-state annotations from 127 human cell and tissue types in the ROADMAP consortium, covering both promoter and whole-genome regions.
Fine-tunable for downstream tasks: Provides fine-tuned models for gene expression classification and regression, with applicability to cell-type classification and 3D genome feature analysis.

Technical Details

ChromBERT adapts the DNABERT architecture—a BERT-style transformer with masked-language-model pretraining—to chromatin-state-annotated genome sequences. Inputs are 15-chromatin-state annotations from 127 ROADMAP cell and tissue types, tokenized along the genome, with pretrained models provided for both promoter and whole-genome regions. The released package includes data-preprocessing utilities, training scripts, attention-based motif detection, and DTW-based clustering for visualization, plus fine-tuned models for gene expression classification and regression. The authors also describe an 18-state system built on roughly 1,699 IHEC cell types, slated for release alongside the corresponding IHEC publication. Code is available under the Apache 2.0 license.

Applications

ChromBERT is intended for genomics and epigenomics researchers studying how chromatin-state patterns organize the genome and relate to gene regulation. Its pretrained and fine-tuned models support gene expression prediction, cell-type classification, and analysis of 3D genome features, while its motif-discovery pipeline helps interpret recurring chromatin-state arrangements around promoters and across the whole genome.

Impact

By framing chromatin states as a language and applying transformer pretraining, ChromBERT extends genomic language modeling beyond raw DNA to the epigenomic annotation layer, offering an interpretable route to chromatin-state motifs. Its open Apache-licensed code and ROADMAP-based pretrained models lower the barrier for downstream fine-tuning. As a preprint-stage model, its broader benchmarking and adoption relative to established epigenomic methods continue to develop.

Citation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Preprint

Lee, S., et al. (2026) ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach. bioRxiv.

DOI: 10.1101/2024.07.25.605219

Recent citations

Papers that recently cited this model.

The cell as a token: high-dimensional geometry in language models and cell embeddings
William Gilpin
Bioinform. · Mar 2025
1
Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization
Saeko Tahara, Haruka Ozaki
Briefings in Functional Genomics · Jan 2025
2

Top citations

The most-cited papers that cite this model.

Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization
Saeko Tahara, Haruka Ozaki
Briefings in Functional Genomics · Jan 2025
2
The cell as a token: high-dimensional geometry in language models and cell embeddings
William Gilpin
Bioinform. · Mar 2025
1

Citations

Total Citations2

Influential0

References43

GitHub

Stars12

Forks6

Open Issues0

Contributors3

Last Push4mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Biology100%
Medicine100%
Computer Science50%
Linguistics50%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?94

Reproducibility — can I retrain it?87

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Chromatin-state language modeling: Treats 15-state chromatin annotations as a token vocabulary and applies BERT-style masked pretraining to learn chromatin-state grammar across the genome.

Motif discovery: Combines learned representations with Dynamic Time Warping clustering to surface recurring chromatin-state motifs, including patterns not previously characterized.

ROADMAP-scale pretraining: Pretrained on chromatin-state annotations from 127 human cell and tissue types in the ROADMAP consortium, covering both promoter and whole-genome regions.

Fine-tunable for downstream tasks: Provides fine-tuned models for gene expression classification and regression, with applicability to cell-type classification and 3D genome feature analysis.

Technical Details

Applications

Impact

ChromBERT (chromatin-state motifs)

Key Features

Technical Details

Applications

Impact

Citation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Recent citations

The cell as a token: high-dimensional geometry in language models and cell embeddings

Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization

Top citations

Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization

The cell as a token: high-dimensional geometry in language models and cell embeddings

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ChromBERT (chromatin-state motifs)

Key Features

Technical Details

Applications

Impact

Citation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Recent citations

The cell as a token: high-dimensional geometry in language models and cell embeddings

Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization

Top citations

Unmeasured human transcription factor ChIP-seq data shape functional genomics and demand strategic prioritization

The cell as a token: high-dimensional geometry in language models and cell embeddings

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ChromBERT (chromatin-state motifs)

#Key Features

#Technical Details

#Applications

#Impact

Citation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ChromBERT (chromatin-state motifs)

#Key Features

#Technical Details

#Applications

#Impact

Citation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact