bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

CellOracle

Morris Lab

Machine learning framework for inferring cell-type-specific gene regulatory networks from single-cell multi-omics data and simulating transcription factor perturbations in silico.

Released: 2023

Overview

Understanding how transcription factors (TFs) control cell identity is fundamental to developmental biology, regenerative medicine, and cancer research. Yet mapping TF-to-target relationships at the resolution of individual cell states, and then predicting the consequences of TF perturbation without running the experiment, remains a major challenge. CellOracle addresses this problem through a two-stage computational framework: first inferring cell-type-specific gene regulatory networks (GRNs) from single-cell multi-omics data, then using those networks as mechanistic models to simulate how transcription factor perturbations would propagate through gene expression programs.

CellOracle was developed by Kenji Kamimoto, Blerta Stringa, Christy M. Hoffmann, Kunal Jindal, Lilianna Solnica-Krezel, and Samantha A. Morris at Washington University School of Medicine in St. Louis. The work was published in Nature in February 2023. The core biological insight motivating CellOracle is that GRNs are not static but cell-state-specific — the set of active regulatory interactions changes as cells differentiate, and any model of TF perturbation must therefore be conditioned on the specific network configuration of the target cell state.

The method distinguishes itself from earlier TF perturbation approaches by grounding its predictions in mechanistic network models rather than statistical associations. By integrating chromatin accessibility data (scATAC-seq) to identify which regulatory elements are open, TF motif scanning to determine which factors can bind those elements, and scRNA-seq to estimate the expression levels and target gene activities within each cluster, CellOracle constructs cell-state-specific GRNs that reflect the actual regulatory landscape of each cell type. These networks are then used not to predict absolute expression levels but to propagate perturbation signals — computing the expected shift in each target gene's expression given a change in TF activity — which produces a simulated gene expression shift vector that can be projected onto single-cell embeddings to visualize cell fate changes.

Key Features

  • Multi-omic GRN inference: CellOracle integrates scATAC-seq chromatin accessibility with TF binding motif databases to identify active cis-regulatory elements per cluster, then uses regularized linear regression on scRNA-seq data to estimate gene-gene interaction weights within those cell states.
  • Cell-state-specific networks: Rather than constructing a single genome-wide GRN, CellOracle builds cluster-wise network configurations that capture how regulatory interactions differ between cell types, enabling perturbation predictions tailored to the specific cell state of interest.
  • In silico TF perturbation simulation: Transcription factor overexpression or knockdown is simulated by modifying the relevant input node in the GRN and propagating the signal through the learned regulatory edges to compute expected downstream expression shifts.
  • Shift vector visualization: Predicted perturbation effects are encoded as high-dimensional gene expression shift vectors that can be projected into low-dimensional embeddings (e.g., UMAP) as arrows indicating the predicted direction of cell identity change.
  • Validation against known biology: Applied to mouse haematopoiesis, human haematopoiesis, and zebrafish embryogenesis, CellOracle correctly predicted the phenotypic outcomes of TF perturbations with established experimental records.
  • Modular multi-omics inputs: The GRN inference pipeline accepts a range of chromatin accessibility inputs including bulk ATAC-seq, scATAC-seq, and even published peak-to-gene regulatory element annotations, making it applicable to datasets with varied data availability.

Technical Details

CellOracle's GRN inference pipeline begins by identifying candidate cis-regulatory elements from scATAC-seq peak data within each annotated cell cluster. TF binding site scanning (using the JASPAR or CisBP motif databases) maps TF motifs to accessible peaks, establishing a set of candidate regulatory connections. A regularized linear regression model (Bayesian ridge regression or Elastic Net) is then fitted within each cluster to estimate the contribution of each candidate TF to each target gene's expression, using the scRNA-seq expression matrix. Edges with low estimated weights are pruned to produce a sparse, biologically interpretable GRN per cell state.

In silico perturbation proceeds in two steps. First, a perturbation vector is constructed by setting the expression of the target TF to zero (knockout) or to an overexpressed value in the GRN input. Second, the GRN model is applied as a linear function to propagate the perturbation signal: for each target gene, the expected expression shift is computed as the sum of the perturbed TF's weight multiplied by the expression change, propagated through one step of the regulatory network. This produces a gene expression shift vector for each cell in the dataset. The shift vectors are projected into the low-dimensional embedding space (typically UMAP) using the same transformation applied to the original data, producing visualizable perturbation arrows that indicate the predicted direction of cell state transition. CellOracle was validated on Ikaros and Gata1 knockdown in haematopoiesis, Nanog and Pou5f1/Sox2 perturbation in pluripotency, and multiple TF perturbations in zebrafish embryogenesis, with predictions showing strong concordance with reported experimental phenotypes.

Applications

CellOracle is principally used for two research applications. The first is hypothesis generation in developmental and cell biology: given a single-cell atlas of a tissue or organism, CellOracle can be used to systematically screen TF perturbations and identify those predicted to drive cells toward specific target states, generating a shortlist of candidates for experimental follow-up. This is particularly valuable in direct reprogramming and differentiation protocol development, where identifying the TF combinations that guide progenitors toward a desired cell type is a bottleneck. The second application is mechanistic dissection of known perturbation phenotypes: when a TF knockout produces a known developmental defect, CellOracle can identify which downstream genes and pathways are predicted to be most directly affected, providing mechanistic hypotheses for the observed phenotype. The tool is also used in cancer biology to understand how TF dysregulation contributes to oncogenic cell state transitions.

Impact

CellOracle's publication in Nature in February 2023 brought multi-omic GRN inference and in silico TF perturbation to a broad audience of biologists, establishing a widely adopted framework that bridges the gap between single-cell atlas data and mechanistic regulatory biology. The model's ability to correctly predict established TF perturbation phenotypes in multiple biological systems provided strong validation that computationally inferred GRNs carry genuine mechanistic information, not merely statistical correlations. The shift vector visualization approach has been particularly influential, providing an intuitive way to interpret perturbation predictions in the context of cell state landscapes. CellOracle has been adopted across developmental biology, hematopoiesis research, and stem cell biology, and its documentation and tutorial ecosystem have made it accessible to wet-lab biologists without deep computational backgrounds. The framework stimulated subsequent work on improving GRN inference accuracy from multi-omic data and on more sophisticated perturbation propagation models.

Sources:

  • GitHub - morris-lab/CellOracle
  • Dissecting cell identity via network inference and in silico gene perturbation | Nature
  • CellOracle Documentation

Tags

gene regulatory network inferencein silico perturbationtranscription factor analysisgraph neural networkself-supervisedtransfer learningsingle-cell transcriptomicschromatintranscription factors

Resources

GitHub RepositoryResearch PaperDocumentation