bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

ChromBERT (chromatin-state motifs)

University of Tokyo

BERT-based model pretrained on 15-state ROADMAP chromatin annotations across 127 human cell types to uncover chromatin-state motifs and predict gene expression.

Released: July 2024

ChromBERT is a BERT-based foundation model that learns the "grammar" of chromatin states across the human genome to uncover recurring chromatin-state motifs. Developed by Seohyun Lee and colleagues in the Nakato lab at the University of Tokyo and first posted to bioRxiv in July 2024 (revised 2026), it adapts the masked-language-model approach of DNABERT to sequences of chromatin-state annotations rather than raw nucleotides.

The model is pretrained on 15-state chromatin annotations derived from 127 human cell and tissue types in the ROADMAP Epigenomics consortium, treating each genomic position's chromatin state as a token. By combining this transformer pretraining with Dynamic Time Warping (DTW) for clustering, ChromBERT identifies previously unrecognized patterns of chromatin states—chromatin-state "motifs"—and provides representations that can be fine-tuned for downstream genomic tasks.

This entry refers specifically to the University of Tokyo chromatin-state-motif ChromBERT built on ROADMAP annotations; it is distinct from an unrelated regulatory-network model that shares the name in the literature. ChromBERT sits alongside epigenomic language models, extending genomic language modeling from DNA sequence to the chromatin-state layer.

#Key Features

  • Chromatin-state language modeling: Treats 15-state chromatin annotations as a token vocabulary and applies BERT-style masked pretraining to learn chromatin-state grammar across the genome.
  • Motif discovery: Combines learned representations with Dynamic Time Warping clustering to surface recurring chromatin-state motifs, including patterns not previously characterized.
  • ROADMAP-scale pretraining: Pretrained on chromatin-state annotations from 127 human cell and tissue types in the ROADMAP consortium, covering both promoter and whole-genome regions.
  • Fine-tunable for downstream tasks: Provides fine-tuned models for gene expression classification and regression, with applicability to cell-type classification and 3D genome feature analysis.

#Technical Details

ChromBERT adapts the DNABERT architecture—a BERT-style transformer with masked-language-model pretraining—to chromatin-state-annotated genome sequences. Inputs are 15-chromatin-state annotations from 127 ROADMAP cell and tissue types, tokenized along the genome, with pretrained models provided for both promoter and whole-genome regions. The released package includes data-preprocessing utilities, training scripts, attention-based motif detection, and DTW-based clustering for visualization, plus fine-tuned models for gene expression classification and regression. The authors also describe an 18-state system built on roughly 1,699 IHEC cell types, slated for release alongside the corresponding IHEC publication. Code is available under the Apache 2.0 license.

#Applications

ChromBERT is intended for genomics and epigenomics researchers studying how chromatin-state patterns organize the genome and relate to gene regulation. Its pretrained and fine-tuned models support gene expression prediction, cell-type classification, and analysis of 3D genome features, while its motif-discovery pipeline helps interpret recurring chromatin-state arrangements around promoters and across the whole genome.

#Impact

By framing chromatin states as a language and applying transformer pretraining, ChromBERT extends genomic language modeling beyond raw DNA to the epigenomic annotation layer, offering an interpretable route to chromatin-state motifs. Its open Apache-licensed code and ROADMAP-based pretrained models lower the barrier for downstream fine-tuning. As a preprint-stage model, its broader benchmarking and adoption relative to established epigenomic methods continue to develop.

Tags

chromatin_state_modelingmotif_discoverygene_expression_predictiontransformerbertself_supervisedlanguage_modeltransfer_learningchromatinepigenomics