Carnegie Mellon University
Multi-modal self-supervised model pre-trained on regulatory genome sequences and transcription factor binding matrices for cell-type-specific regulatory prediction.
GeneBERT is a multi-modal, self-supervised pre-training framework for regulatory genome modeling, developed by Shentong Mo, Xi Fu, and colleagues including Eric P. Xing at Carnegie Mellon University and collaborating institutions. Published as an arXiv preprint in October 2021, GeneBERT addresses a fundamental limitation of existing genomic sequence models: they process DNA sequences independently for each genomic locus without accounting for how the broader epigenomic context — specifically, the simultaneous binding landscape of many transcription factors across many genomic regions — shapes regulatory activity in a cell-type-specific manner.
The core insight of GeneBERT is that regulatory genome analysis is inherently multi-modal: the functional state of a genomic sequence depends not only on its linear nucleotide composition but also on the two-dimensional matrix of transcription factor binding patterns across all regulatory regions in a given cell type. By treating both modalities — the 1D sequence and the 2D TF-by-region binding matrix — as inputs to a joint pre-training framework, GeneBERT learns representations that are aware of cell-type context and inter-regulatory-element interactions, enabling more accurate predictions on downstream tasks compared to sequence-only approaches.
This multi-modal design draws direct inspiration from BERT's masked language modeling paradigm but extends it to the biological domain in a way that captures the combinatorial, context-dependent nature of transcription factor binding. Pre-training on approximately 17 million genome sequences from ATAC-seq data spanning multiple cell types, GeneBERT learns to reconstruct masked genomic tokens while also attending to the cell-type-specific TF binding context, producing representations that generalize across regulatory prediction tasks including promoter classification, transcription factor binding site prediction, disease risk estimation, and splicing site identification.
GeneBERT adapts the transformer encoder architecture from BERT for regulatory genome modeling. DNA input sequences are tokenized using overlapping k-mers and embedded into a learned representation space, analogous to word tokens in natural language BERT. These sequence embeddings are processed by a transformer encoder with multi-head self-attention and feed-forward sublayers. The key architectural innovation is the incorporation of a second input stream: a 2D matrix of shape (number of transcription factors × number of regulatory regions) representing the TF binding landscape for a specific cell type, which is embedded and fused with the sequence representations via cross-attention or concatenation operations. Three complementary pre-training tasks were designed to exploit both modalities: (1) masked sequence reconstruction, where a fraction of nucleotide tokens are masked and the model learns to predict them from context — directly analogous to BERT's masked language modeling; (2) masked TF-binding prediction, where entries in the TF-by-region matrix are masked and the model predicts them from the sequence and remaining binding context; and (3) cross-modal alignment, which encourages consistent representations between the sequence and TF-binding modalities. Pre-training was conducted on approximately 17 million sequence-region pairs derived from ATAC-seq data spanning multiple human cell types. In benchmark evaluations on downstream regulatory tasks, GeneBERT outperformed sequence-only BERT-based baselines including DNABERT on promoter classification (where it achieved over 90% accuracy on standard benchmarks), TF binding site prediction, and disease risk estimation from GWAS variants, validating the utility of the multi-modal design.
GeneBERT is applicable to any regulatory genomics task where cell-type context improves prediction accuracy. Its primary validated applications include promoter classification — distinguishing promoters from non-regulatory sequences — and transcription factor binding site prediction, where the model's awareness of co-binding patterns enables more accurate identification of cell-type-specific TF occupancy from sequence alone. Disease risk estimation is another key application: by encoding GWAS variant loci with their cell-type-specific regulatory context, GeneBERT produces embeddings that better capture the functional impact of noncoding variants on regulatory activity. Splicing site prediction, while a distinct regulatory mechanism, also benefits from the model's multi-scale sequence representations. The multi-modal pre-training framework is also conceptually extendable to other regulatory modalities, such as incorporating Hi-C chromatin contact maps as a third input stream for promoter-enhancer interaction prediction.
GeneBERT contributed to a productive period of applying BERT-style pre-training to regulatory genomics sequences, alongside DNABERT, Nucleotide Transformer, and related models. Its multi-modal design distinguishing it from purely sequence-based approaches was an early proposal for integrating epigenomic context into genomic foundation model pre-training — a concept that has been revisited and developed in subsequent multimodal genomic models. The work demonstrated that self-supervised objectives designed specifically for the combinatorial logic of regulatory biology can yield representations that transfer across regulatory prediction tasks. A key limitation is that GeneBERT was demonstrated primarily at the scale of ATAC-seq-defined regulatory regions rather than chromosome-scale sequences, constraining its ability to capture very long-range regulatory interactions. The arXiv preprint nature of the work also means it has not undergone formal peer review, and independent replication of benchmark results would be valuable. Nevertheless, the multi-modal pre-training paradigm and the TF-by-region matrix encoding remain conceptually valuable contributions to the genomic foundation model literature.