Theis Lab
Deep learning model predicting single-cell read counts from DNA sequence features and cell transition graphs to identify transcriptional regulators.
Decoding the sequence-based logic that drives cell fate transitions is one of the central challenges in regulatory genomics. While chromatin accessibility assays like scATAC-seq reveal which genomic regions are open in each cell state, converting that information into mechanistic understanding — which transcription factors are active, how their activities change during differentiation, and how the interplay of sequence features and cell topology shapes regulatory dynamics — requires modeling that integrates DNA sequence, epigenomic signal, and cell-cell relationships simultaneously. MuBind was developed to address this integrative challenge within a single, unified deep learning architecture.
MuBind was developed by Ignacio L. Ibarra, Jonas Schneeberger, Erkan Erdogan, Linda Redl, Lara Martens, Dominik Klein, Hananeh Aliee, and Fabian J. Theis at the Institute of Computational Biology, Helmholtz Center Munich (Theis Lab). The preprint was posted on bioRxiv in August 2024. The model distinguishes itself from existing sequence-to-activity models by incorporating cell transition graphs — dynamic relationships between cell states derived from RNA-based trajectory models — as a structural prior that guides how motif activities are propagated between neighboring cells during training. This allows MuBind to learn not just which motifs are active in individual cell clusters but how motif activity evolves across the developmental landscape.
The integration of graph structure into a sequence-activity model addresses a fundamental limitation of existing approaches: standard models like ChromBPNet and DeepSEA predict genomic signal from sequence features independently for each genomic locus or cell state, without encoding information about how regulatory programs evolve across a continuous developmental trajectory. MuBind's graph component allows the model to borrow statistical strength across related cell states, improving the identification of TFs that drive transitions between them — precisely the regulators most relevant to developmental and disease biology.
MuBind's architecture consists of three integrated modules. First, a convolutional sequence encoder processes the DNA sequence of each genomic region (ATAC-seq peak or ChIP-seq region) to extract sequence features and learn de novo motif representations. These sequence features are parameterized as convolutional filters that learn position-weight-matrix-like representations of TF binding preferences, analogous to those learned by SELEX-based models. Second, a cell activity module assigns a scalar activity weight to each learned motif in each cell (or cell cluster), representing the effective binding activity of the corresponding TF in that cell state. Third, a graph neural network module takes the cell activity matrix and propagates information across the cell transition graph — where nodes are cells or clusters and edges represent transition relationships — producing activity representations that reflect both local cell state and neighborhood context.
The final read count prediction for a given genomic region in a given cell is computed as a function of the sequence features, the cell's (GNN-updated) motif activities, and a learned baseline. The model is trained to minimize the divergence between predicted and observed read counts from scATAC-seq or bulk ATAC-seq data. Performance was evaluated against PyProBound — a state-of-the-art binding affinity prediction model — on a benchmark of 100 HT-SELEX datasets. MuBind's learned motifs and their relative activities showed high agreement with the ground-truth TF binding specificities (R = 0.81), validating that the sequence learning component produces biologically accurate motif representations. Three biological case studies were presented: pancreatic endocrinogenesis, mouse neurogenesis, and human brain organoids, with motif-pseudotime correlation plots and TF expression data providing independent validation of the identified regulators.
MuBind's primary application is the identification of transcriptional regulators driving cell fate transitions from single-cell chromatin accessibility data. In developmental biology, where understanding which TFs orchestrate differentiation is a central question, MuBind provides a data-driven framework to nominate key regulators from scATAC-seq atlases without requiring prior knowledge of the relevant TFs. The model is particularly suited to cell transition analysis — identifying which TFs are most active at decision points between cell states — rather than simply characterizing the motif landscape of stable terminal cell types. In disease contexts including cancer, where epigenomic reprogramming drives oncogenic state transitions, MuBind can identify the sequence-based regulatory logic underlying observed chromatin changes. The model is also useful as a component of multi-omic regulatory analysis pipelines, providing sequence-grounded motif activity estimates that complement TF inference methods such as CellOracle that operate on RNA-level regulon information.
MuBind represents a meaningful advance in the integration of DNA sequence modeling with single-cell regulatory genomics, particularly in its use of cell transition graphs as structural priors that encode developmental dynamics. By demonstrating competitive binding prediction performance against dedicated SELEX models while simultaneously providing single-cell resolved motif activity estimates and developmental trajectory context, MuBind bridges the gap between sequence-level TF binding characterization and cell-level regulatory dynamics analysis. The model's validation on multiple well-studied developmental systems — pancreatic endocrinogenesis, neurogenesis — with recoverable known regulators provides confidence that its discoveries in less well-characterized biological contexts will be biologically meaningful. As part of the Theis Lab's broader portfolio of single-cell regulatory analysis tools, MuBind complements CellOracle (GRN inference from multi-omics) and scGen/CPA (perturbation prediction) by contributing sequence-level mechanistic grounding to the regulatory inference pipeline.
Sources: