Overview

scMoFormer is a transformer-based framework for single-cell multi-omics prediction, developed by researchers at Michigan State University and Emory University and published at the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023). The model addresses one of the central challenges in single-cell genomics: predicting the abundance of one molecular modality — such as surface protein levels — from measurements of another, such as RNA gene expression, within the same individual cell.

Prior approaches to this cross-modality prediction problem relied on static interaction graphs processed by graph neural networks (GNNs). Static graphs cannot incorporate information about the downstream prediction task during training, and deeply stacked GNN layers suffer from over-smoothing. scMoFormer replaces this paradigm with an end-to-end transformer architecture that learns dynamic attention weights guided by the target task, enabling it to capture both within-modality structure and between-modality relationships in a single unified framework.

The model demonstrated its practical strength in the NeurIPS 2022 Open Problems in Single Cell Analysis competition on Kaggle, achieving a silver medal rank of 24 out of 1,221 competing teams — placing in the top 2% — without using ensemble methods. This result established scMoFormer as a competitive baseline for multimodal single-cell prediction.

Key Features

Heterogeneous graph construction: Builds a multimodal graph with four subgraphs encoding protein-protein, gene-gene, gene-protein, and cell-gene interactions. External biological knowledge from databases such as STRING is incorporated at construction time, grounding the model in established molecular biology.
Three modality-specific transformers: Dedicated transformer modules process cells, genes, and proteins independently. The cell transformer uses kernelized (linearized) attention to reduce computational complexity from quadratic to linear in the number of cells. Gene and protein transformers combine GNN blocks with global attention to capture both local neighborhood structure and long-range dependencies.
Cross-modality aggregation via GraphSAGE: After modality-specific representations are computed, a GraphSAGE-based message-passing step bridges the three transformers, propagating information across modality boundaries and producing integrated cell embeddings for final prediction.
External domain knowledge integration: The framework is explicitly designed to accept prior biological knowledge (e.g., protein interaction networks, gene regulatory relationships) as graph edges, allowing wet-lab domain knowledge to directly inform model structure rather than being learned from scratch.
Linearized attention for scalability: The kernelized attention mechanism in the cell transformer enables processing of datasets with tens of thousands of cells without the quadratic memory cost of standard self-attention, making the model practical for real single-cell atlas-scale data.

Technical Details

scMoFormer processes single-cell multi-omics data through a pipeline that begins with graph construction and ends with a linear prediction head for cross-modality output. The heterogeneous graph encodes four types of edges — protein-protein interactions, gene co-expression relationships, gene-protein co-regulation links, and cell-gene measurement associations — and supports Laplacian and random walk positional encodings to capture structural information within each subgraph. Each of the three transformers (cell, gene, protein) is independently parameterized, and their outputs are fused via GraphSAGE aggregation before passing to the final prediction layer.

On two benchmark datasets derived from the NeurIPS 2022 competition, scMoFormer outperformed prior methods across all reported metrics. On the CITE dataset (RNA-to-protein prediction), it achieved an RMSE of 1.627 and Pearson correlation of 0.886, compared to competing baselines including BABEL, CMAE, scMM, and ScMoGNN. On the GEX2ADT dataset, it reached an RMSE of 0.420 and Pearson correlation of 0.877. The model supports four prediction directions: GEX-to-ADT, ADT-to-GEX, GEX-to-ATAC, and ATAC-to-GEX, covering the major cross-modality translation tasks in current single-cell assays.

Applications

scMoFormer is suited for researchers working with CITE-seq, ASAP-seq, or other joint-profiling technologies that simultaneously measure gene expression and surface protein abundance (or chromatin accessibility) in the same cells. A direct application is cost reduction: joint-profiling assays are expensive, and a well-calibrated model can impute unmeasured modalities from cheaper single-modality experiments. The model is also applicable to data integration tasks, where cells measured in separate modality-specific experiments must be aligned into a shared embedding space for downstream clustering or trajectory analysis.

Impact

scMoFormer contributed a key methodological shift in single-cell multi-omics analysis by demonstrating that end-to-end transformers with task-aware attention can outperform GNN-based methods on cross-modality prediction. Its competitive performance in the NeurIPS 2022 Kaggle challenge — achieved without ensemble methods — provided a reproducible benchmark comparison point for subsequent work in the field. The publicly available implementation under the MIT license, maintained by the OmicsML group, has been adopted as a baseline in multimodal single-cell benchmarking studies. A notable limitation is that the model was developed and evaluated primarily on protein-from-RNA and chromatin-from-RNA prediction tasks; its performance on less well-studied modality pairs and on datasets with higher technical noise or sparser coverage has not been as thoroughly characterized.

Citation

Single-Cell Multimodal Prediction via Transformers

Tang, W., Wen, H., Liu, R., Ding, J., Jin, W., Xie, Y., Liu, H., & Tang, J. (2023). Single-Cell Multimodal Prediction via Transformers. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (pp. 2464-2474). ACM.

DOI: 10.1145/3583780.3615061

Overview

Key Features

Heterogeneous graph construction: Builds a multimodal graph with four subgraphs encoding protein-protein, gene-gene, gene-protein, and cell-gene interactions. External biological knowledge from databases such as STRING is incorporated at construction time, grounding the model in established molecular biology.

Three modality-specific transformers: Dedicated transformer modules process cells, genes, and proteins independently. The cell transformer uses kernelized (linearized) attention to reduce computational complexity from quadratic to linear in the number of cells. Gene and protein transformers combine GNN blocks with global attention to capture both local neighborhood structure and long-range dependencies.

Cross-modality aggregation via GraphSAGE: After modality-specific representations are computed, a GraphSAGE-based message-passing step bridges the three transformers, propagating information across modality boundaries and producing integrated cell embeddings for final prediction.

External domain knowledge integration: The framework is explicitly designed to accept prior biological knowledge (e.g., protein interaction networks, gene regulatory relationships) as graph edges, allowing wet-lab domain knowledge to directly inform model structure rather than being learned from scratch.

Linearized attention for scalability: The kernelized attention mechanism in the cell transformer enables processing of datasets with tens of thousands of cells without the quadratic memory cost of standard self-attention, making the model practical for real single-cell atlas-scale data.

Technical Details

Applications

Impact

Citation

Single-Cell Multimodal Prediction via Transformers

DOI: 10.1145/3583780.3615061

scMoFormer

Overview

Key Features

Technical Details

Applications

Impact

Citation

Single-Cell Multimodal Prediction via Transformers

Metrics

GitHub

Citations

Tags

Resources

scMoFormer

Overview

Key Features

Technical Details

Applications

Impact

Citation

Single-Cell Multimodal Prediction via Transformers

Metrics

GitHub

Citations

Tags

Resources