GeneBERT

Multi-modal self-supervised transformer for regulatory genomics, pre-trained on DNA sequence together with transcription factor binding matrices.

Released: October 2021

GeneBERT is a multi-modal, self-supervised pre-training framework for regulatory genome modeling, developed by Shentong Mo, Xi Fu, and colleagues including Eric P. Xing at Carnegie Mellon University and collaborating institutions. Published as an arXiv preprint in October 2021, GeneBERT addresses a fundamental limitation of existing genomic sequence models: they process DNA sequences independently for each genomic locus without accounting for how the broader epigenomic context — specifically, the simultaneous binding landscape of many transcription factors across many genomic regions — shapes regulatory activity in a cell-type-specific manner.

The core insight of GeneBERT is that regulatory genome analysis is inherently multi-modal: the functional state of a genomic sequence depends not only on its linear nucleotide composition but also on the two-dimensional matrix of transcription factor binding patterns across all regulatory regions in a given cell type. By treating both modalities — the 1D sequence and the 2D TF-by-region binding matrix — as inputs to a joint pre-training framework, GeneBERT learns representations that are aware of cell-type context and inter-regulatory-element interactions, enabling more accurate predictions on downstream tasks compared to sequence-only approaches.

This multi-modal design draws direct inspiration from BERT's masked language modeling paradigm but extends it to the biological domain in a way that captures the combinatorial, context-dependent nature of transcription factor binding. Pre-training on approximately 17 million genome sequences from ATAC-seq data spanning multiple cell types, GeneBERT learns to reconstruct masked genomic tokens while also attending to the cell-type-specific TF binding context, producing representations that generalize across regulatory prediction tasks including promoter classification, transcription factor binding site prediction, disease risk estimation, and splicing site identification.

Key Features

Multi-modal input design: Simultaneously processes 1D genomic sequences and a 2D matrix of transcription factor binding patterns across regulatory regions, enabling context-aware regulatory prediction that accounts for the combinatorial logic of TF co-binding.
Three self-supervised pre-training tasks: Employs a suite of complementary objectives — masked sequence modeling, masked TF-binding modeling, and cross-modal consistency — to improve representational robustness and generalization across downstream regulatory tasks.
Cell-type-aware regulatory encoding: By conditioning sequence representations on the TF-by-region binding matrix for a specific cell type, the model learns cell-type-specific regulatory grammar that is inaccessible to sequence-only models.
Broad downstream task applicability: Fine-tuned representations transfer effectively to promoter classification, transcription factor binding site prediction, disease risk estimation from noncoding variants, and splicing site prediction — demonstrating generalist regulatory representation learning.
Large-scale ATAC-seq pre-training: Pre-trained on approximately 17 million genome sequences derived from ATAC-seq open-chromatin regions across multiple cell types, providing diverse regulatory sequence coverage spanning a wide range of chromatin states.
BERT-style tokenization for DNA: Adapts the BERT masked language modeling objective to DNA sequences using k-mer tokenization, building on the successful paradigm of DNABERT while extending it to multi-modal inputs.

Technical Details

GeneBERT adapts the transformer encoder architecture from BERT for regulatory genome modeling. DNA input sequences are tokenized using overlapping k-mers and embedded into a learned representation space, analogous to word tokens in natural language BERT. These sequence embeddings are processed by a transformer encoder with multi-head self-attention and feed-forward sublayers. The key architectural innovation is the incorporation of a second input stream: a 2D matrix of shape (number of transcription factors × number of regulatory regions) representing the TF binding landscape for a specific cell type, which is embedded and fused with the sequence representations via cross-attention or concatenation operations. Three complementary pre-training tasks were designed to exploit both modalities: (1) masked sequence reconstruction, where a fraction of nucleotide tokens are masked and the model learns to predict them from context — directly analogous to BERT's masked language modeling; (2) masked TF-binding prediction, where entries in the TF-by-region matrix are masked and the model predicts them from the sequence and remaining binding context; and (3) cross-modal alignment, which encourages consistent representations between the sequence and TF-binding modalities. Pre-training was conducted on approximately 17 million sequence-region pairs derived from ATAC-seq data spanning multiple human cell types. In benchmark evaluations on downstream regulatory tasks, GeneBERT outperformed sequence-only BERT-based baselines including DNABERT on promoter classification (where it achieved over 90% accuracy on standard benchmarks), TF binding site prediction, and disease risk estimation from GWAS variants, validating the utility of the multi-modal design.

Applications

GeneBERT is applicable to any regulatory genomics task where cell-type context improves prediction accuracy. Its primary validated applications include promoter classification — distinguishing promoters from non-regulatory sequences — and transcription factor binding site prediction, where the model's awareness of co-binding patterns enables more accurate identification of cell-type-specific TF occupancy from sequence alone. Disease risk estimation is another key application: by encoding GWAS variant loci with their cell-type-specific regulatory context, GeneBERT produces embeddings that better capture the functional impact of noncoding variants on regulatory activity. Splicing site prediction, while a distinct regulatory mechanism, also benefits from the model's multi-scale sequence representations. The multi-modal pre-training framework is also conceptually extendable to other regulatory modalities, such as incorporating Hi-C chromatin contact maps as a third input stream for promoter-enhancer interaction prediction.

Impact

GeneBERT contributed to a productive period of applying BERT-style pre-training to regulatory genomics sequences, alongside DNABERT, Nucleotide Transformer, and related models. Its multi-modal design distinguishing it from purely sequence-based approaches was an early proposal for integrating epigenomic context into genomic foundation model pre-training — a concept that has been revisited and developed in subsequent multimodal genomic models. The work demonstrated that self-supervised objectives designed specifically for the combinatorial logic of regulatory biology can yield representations that transfer across regulatory prediction tasks. A key limitation is that GeneBERT was demonstrated primarily at the scale of ATAC-seq-defined regulatory regions rather than chromosome-scale sequences, constraining its ability to capture very long-range regulatory interactions. The arXiv preprint nature of the work also means it has not undergone formal peer review, and independent replication of benchmark results would be valuable. Nevertheless, the multi-modal pre-training paradigm and the TF-by-region matrix encoding remain conceptually valuable contributions to the genomic foundation model literature.

Citation

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

Preprint

Mo, S., et al. (2021) Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types. arXiv.org.

DOI: 10.48550/arXiv.2110.05231

Recent citations

Papers that recently cited this model.

Large language models in radiogenomics: a comprehensive survey of applications from imaging to genetics
Muhammad Nadeem Cheema, Anam Nazir, Arif O Harmanci, et al.
The Visual Computer · Feb 2026
0
Advancing non-coding RNA annotation with RNA sequence foundation models: structure and function perspectives
Naima Vahab, Sonika Tyagi
BMC Artificial Intelligence · Oct 2025
0
Large Language Models Meet Virtual Cell: A Survey
Krinos Li, Xianglu Xiao, Shenglong Deng, et al.
arXiv.org · Oct 2025
0

Top citations

The most-cited papers that cite this model.

Applications of transformer-based language models in bioinformatics: a survey
Shuang Zhang, Rui Fan, Yuti Liu, et al.
Bioinformatics Advances · Jan 2023
163
Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions
Yuting He, Fuxiang Huang, Xinrui Jiang, et al.
IEEE Reviews in Biomedical Engineering · Apr 2024
134
Large language models and their applications in bioinformatics
O. Sarumi, Dominik Heider
Computational and Structural Biotechnology Journal · Oct 2024
69
The Large Language Models on Biomedical Data Analysis: A Survey
Wei Lan, Zhentao Tang, Mingyang Liu, et al.
IEEE journal of biomedical and health informatics · Feb 2025
60
LangCell: Language-Cell Pre-training for Cell Identity Understanding
Suyuan Zhao, Jiahuan Zhang, Yizhen Luo, et al.
International Conference on Machine Learning · May 2024
35

Citations

Total Citations27

Influential1

References34

Fields of citing research

Computer Science100%
Biology81%
Medicine48%
Environmental Science4%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

18Closed

Usability — can I run it?9

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Multi-modal input design: Simultaneously processes 1D genomic sequences and a 2D matrix of transcription factor binding patterns across regulatory regions, enabling context-aware regulatory prediction that accounts for the combinatorial logic of TF co-binding.

Three self-supervised pre-training tasks: Employs a suite of complementary objectives — masked sequence modeling, masked TF-binding modeling, and cross-modal consistency — to improve representational robustness and generalization across downstream regulatory tasks.

Cell-type-aware regulatory encoding: By conditioning sequence representations on the TF-by-region binding matrix for a specific cell type, the model learns cell-type-specific regulatory grammar that is inaccessible to sequence-only models.

Broad downstream task applicability: Fine-tuned representations transfer effectively to promoter classification, transcription factor binding site prediction, disease risk estimation from noncoding variants, and splicing site prediction — demonstrating generalist regulatory representation learning.

Large-scale ATAC-seq pre-training: Pre-trained on approximately 17 million genome sequences derived from ATAC-seq open-chromatin regions across multiple cell types, providing diverse regulatory sequence coverage spanning a wide range of chromatin states.

BERT-style tokenization for DNA: Adapts the BERT masked language modeling objective to DNA sequences using k-mer tokenization, building on the successful paradigm of DNABERT while extending it to multi-modal inputs.

Technical Details

Applications

Impact

GeneBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

GeneBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact