Overview

Single-cell RNA sequencing has generated a flood of high-dimensional gene expression data, and foundation models have emerged as a natural tool for learning compressed, generalizable representations of cellular state. Yet a core tension persists in this field: transformer architectures were designed for sequentially ordered tokens — words in sentences, amino acids in proteins — while gene expression profiles have no inherent ordering. Genes within a cell exist in a network of regulatory relationships, not a linear sequence, so forcing a rank-order or arbitrary ordering onto them misses the biological structure that actually governs cell behavior.

GREmLN (Gene Regulatory Embedding-based Large Neural model) directly confronts this limitation. Developed by researchers at the Chan Zuckerberg Initiative, Columbia University (including the laboratory of Andrea Califano), and CZ Biohub, and posted as a preprint in July 2025, GREmLN is a transcriptomics foundation model that encodes gene regulatory network (GRN) structure — and protein-protein interaction (PPI) network topology — directly into its self-attention mechanism via graph signal processing. Rather than treating gene tokens as an unordered bag or imposing an arbitrary positional ranking, GREmLN uses the graph Laplacian of biological molecular networks to constrain which genes attend to which, biasing the model toward biologically meaningful long-range regulatory dependencies from the very first layer of pretraining.

The results are striking: GREmLN achieves superior performance on cell type annotation, graph structure understanding tasks, and fine-tuned perturbation prediction using only 10.3 million parameters — less than one-third the parameter count of comparable baselines and roughly one-tenth of the 100-million-parameter scFoundation model. This parameter efficiency, driven by the inductive bias of network structure rather than raw scale, marks GREmLN as a compelling alternative to brute-force scaling in single-cell foundation models. The model is released as part of the CZI Virtual Cell Platform, reflecting CZI's broader commitment to open, reproducible tools for cell biology.

Key Features

Graph-signal-processing attention: GREmLN applies a diffusion kernel to the graph Laplacian of a gene regulatory or PPI network, constructing a kernel Gram matrix that modulates self-attention queries. This encodes non-local gene-gene regulatory dependencies directly into the attention computation, structuring information flow according to known biology rather than positional proximity.
Chebyshev polynomial approximation for scalability: Computing the full diffusion kernel over tens of thousands of gene nodes is computationally prohibitive. GREmLN uses a Chebyshev polynomial-based approximation of the kernel Gram matrix, enabling the approach to scale to large molecular interaction graphs and long gene token sequences without sacrificing the long-range dependency structure captured by graph diffusion.
Biologically informed gene embeddings: Because the attention mechanism is grounded in gene regulatory topology, the resulting gene embeddings linearly reconstruct expression profiles and encode regulatory hierarchy. Embeddings reflect the causal network context of each gene, not just its statistical co-expression patterns, providing representations that are interpretable in terms of known regulatory biology.
Parameter efficiency through inductive bias: With 10.3 million learnable parameters, GREmLN outperforms models with 30–100M parameters on cell type annotation benchmarks. The model demonstrates that incorporating biological inductive biases accelerates training convergence and reduces the data requirements typically associated with large-scale pretraining.
Generalization to unseen cell types: GREmLN sets new benchmarks in predicting unseen cell types by leveraging its graph-structured priors. Where purely data-driven models must interpolate between cell states observed during pretraining, the regulatory network structure guides the model's behavior in novel biological contexts.
Support for multiple molecular interaction graphs: The architecture is not limited to a single GRN. The framework can incorporate validated edges from GRN databases, PPI networks, or diffusion kernels derived from any graph of molecular interactions, gaining further accuracy as the quality and coverage of the underlying network improve.
Reverse perturbation prediction: Beyond representation learning, GREmLN can be fine-tuned for reverse perturbation tasks — predicting what genetic perturbation would be needed to shift a cell toward a desired transcriptional state — a capability with direct relevance for therapeutic intervention design.

Technical Details

GREmLN is a transformer encoder with approximately 10.3 million parameters, pretrained on a corpus of 11 million single-cell RNA-seq profiles spanning 19,000 genes from healthy human cells sourced from the CELLxGENE dataset. The pretraining corpus covers 162 cell types across diverse tissues, providing broad coverage of human transcriptomic space without requiring disease-state data.

The key architectural innovation lies in how graph signal processing is integrated into self-attention. Standard self-attention computes query-key-value interactions based solely on the input token embeddings and positional encodings. GREmLN modifies the attention computation by applying a graph diffusion kernel — specifically a matrix-exponential smoothing of the graph Laplacian constructed from the regulatory network — to the query embeddings before computing attention scores. This means attention weights reflect not just the learned similarity between gene representations, but also the regulatory distance between genes in the known biological network. Genes that are tightly co-regulated in the GRN attend more strongly to each other; genes in separate regulatory modules are attenuated. The Chebyshev polynomial approximation allows this operation to be computed efficiently for graphs with thousands of nodes, avoiding the O(n²) cost of dense kernel matrix construction.

Pretraining uses a masked gene prediction objective analogous to masked language modeling: a subset of gene tokens in each cell's expression profile is masked, and the model is trained to reconstruct the masked gene identities from surrounding expression context and regulatory structure. Training was completed in a single epoch over the 11-million-cell corpus on 8 NVIDIA H100 80GB GPUs in parallel — an unusually fast pretraining run that reflects the accelerated convergence enabled by the network structure inductive biases.

On downstream benchmarks, GREmLN achieves a macro F1 score of 0.929 on human immune cell type annotation, outperforming scGPT (0.924), Geneformer (0.792), and scFoundation (0.879). These results are particularly notable given the order-of-magnitude parameter count difference relative to scFoundation. Performance gains are consistent across graph structure understanding tasks and perturbation prediction after fine-tuning, demonstrating that the regulatory network embedding generalizes across heterogeneous downstream biological questions.

Applications

GREmLN is well-suited for computational biologists working with single-cell RNA-seq data who want representations that reflect known regulatory biology rather than purely statistical patterns. Its primary use cases include cell type annotation in novel or undercharacterized tissues, where regulatory network priors help the model generalize beyond observed cell types in the training set. The in silico perturbation prediction capability is directly applicable to therapeutic target prioritization: by fine-tuning on perturbation screen data (e.g., Perturb-seq datasets) and using the model's regulatory representations, researchers can predict which gene knockdowns or overexpressions are most likely to redirect a pathological cell state toward a healthy one. Beyond individual perturbations, the model's architecture naturally supports reasoning about combinatorial gene interactions, a capability that is particularly valuable for understanding complex regulatory programs in cancer or developmental biology. GREmLN is also applicable to network biology tasks — predicting gene network centrality, identifying hub regulators, and characterizing transcription factor target sets — by virtue of the attention weights encoding regulatory hierarchy. The model is released through the CZI Virtual Cell Platform alongside tutorials that guide users from raw count matrices through embedding extraction and downstream analysis.

Impact

GREmLN represents a meaningful methodological advance in single-cell foundation model design by demonstrating that biological inductive biases — specifically, gene regulatory network topology — can substitute for large-scale parameter counts to achieve competitive or superior performance. This finding challenges the prevailing assumption that larger models trained on more data are always better, and it suggests that the field may be approaching single-cell transcriptomics modeling in a unnecessarily data-hungry way when rich prior knowledge about gene regulation is available and underutilized. The 10.3M-parameter model outperforming 100M-parameter alternatives on key benchmarks is a result that will likely motivate renewed attention to network-aware architectures across the single-cell and genomics foundation model communities. From a scientific standpoint, the model's ability to embed regulatory relationships into learned representations makes its outputs more interpretable: attention maps can be examined to understand which regulatory relationships the model is using to classify a cell type or predict a perturbation outcome, unlike black-box large language model approaches. A key limitation is that the quality of GREmLN's representations depends on the completeness and accuracy of the underlying gene regulatory network used during training; for organisms or cell types with poorly characterized GRNs, the inductive bias may be less beneficial or even counterproductive if the network is substantially incorrect. The model is also currently trained only on healthy human cells, meaning its performance on disease-specific transcriptional programs may require additional fine-tuning on disease-relevant data. As part of the CZI Virtual Cell Platform's open-model ecosystem, GREmLN joins a growing suite of tools designed to make single-cell AI accessible to the broader biology community, and its parameter efficiency makes it particularly deployable in resource-constrained environments.

Overview

Key Features

Graph-signal-processing attention: GREmLN applies a diffusion kernel to the graph Laplacian of a gene regulatory or PPI network, constructing a kernel Gram matrix that modulates self-attention queries. This encodes non-local gene-gene regulatory dependencies directly into the attention computation, structuring information flow according to known biology rather than positional proximity.

Chebyshev polynomial approximation for scalability: Computing the full diffusion kernel over tens of thousands of gene nodes is computationally prohibitive. GREmLN uses a Chebyshev polynomial-based approximation of the kernel Gram matrix, enabling the approach to scale to large molecular interaction graphs and long gene token sequences without sacrificing the long-range dependency structure captured by graph diffusion.

Biologically informed gene embeddings: Because the attention mechanism is grounded in gene regulatory topology, the resulting gene embeddings linearly reconstruct expression profiles and encode regulatory hierarchy. Embeddings reflect the causal network context of each gene, not just its statistical co-expression patterns, providing representations that are interpretable in terms of known regulatory biology.

Parameter efficiency through inductive bias: With 10.3 million learnable parameters, GREmLN outperforms models with 30–100M parameters on cell type annotation benchmarks. The model demonstrates that incorporating biological inductive biases accelerates training convergence and reduces the data requirements typically associated with large-scale pretraining.

Generalization to unseen cell types: GREmLN sets new benchmarks in predicting unseen cell types by leveraging its graph-structured priors. Where purely data-driven models must interpolate between cell states observed during pretraining, the regulatory network structure guides the model's behavior in novel biological contexts.

Support for multiple molecular interaction graphs: The architecture is not limited to a single GRN. The framework can incorporate validated edges from GRN databases, PPI networks, or diffusion kernels derived from any graph of molecular interactions, gaining further accuracy as the quality and coverage of the underlying network improve.

Reverse perturbation prediction: Beyond representation learning, GREmLN can be fine-tuned for reverse perturbation tasks — predicting what genetic perturbation would be needed to shift a cell toward a desired transcriptional state — a capability with direct relevance for therapeutic intervention design.

Technical Details

Applications

Impact

GREmLN

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

GREmLN

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources