Chan Zuckerberg Initiative
A single-cell perturbation model that augments scGPT with gene-level language embeddings from NCBI, UniProt, and Gene Ontology to improve multi-gene perturbation prediction.
Predicting what happens to a cell's transcriptome when a gene is perturbed — knocked out, overexpressed, or chemically inhibited — is one of the core challenges of functional genomics. Large-scale perturbation screens such as Perturb-seq link genetic perturbations to single-cell transcriptomic readouts at scale, but they cannot exhaustively cover all possible perturbations, all cell types, or all combinations of multi-gene interventions. Computational models that can generalize from a subset of observed perturbations to predict the effect of unseen ones are therefore enormously valuable for hypothesis generation and experimental prioritization.
Most single-cell foundation models approach this problem by learning from scRNA-seq count data alone. Models like scGPT are pretrained on tens of millions of transcriptomes, learning rich representations of gene expression patterns across cell types and biological contexts. These representations capture statistical regularities in gene co-expression and cell state transitions, and they transfer well to perturbation prediction when fine-tuned on Perturb-seq data. However, scRNA-seq counts encode the observable outputs of gene regulation, not the underlying molecular mechanisms — and there is a wealth of curated, expert-annotated knowledge about gene function, molecular mechanisms, protein properties, and subcellular localization that is not captured by expression data alone.
scGenePT, developed by Ana-Maria Istrate, Donghui Li, and Theofanis Karaletsos at the Chan Zuckerberg Initiative and posted as a preprint in October 2024, investigates a specific and practically important question: can textual knowledge about genes, encoded as language embeddings, improve the prediction of single-cell perturbation outcomes beyond what scRNA-seq data alone can achieve? The answer is yes — but with important nuances about which types of knowledge help most and in which experimental contexts. scGenePT extends scGPT by injecting gene-level language embeddings derived from three distinct knowledge sources (NCBI gene descriptions, UniProt protein summaries, and Gene Ontology annotations across molecular function, biological process, and cellular component) directly into the gene representation layer, and demonstrates that this textual grounding provides additive and complementary value to data-driven representations, particularly for predicting the effects of gene pairs that were not individually observed during training.
The model is available through the CZI Virtual Cell Platform and represents an important step toward integrating the structured, curated knowledge encoded in biological databases with the statistical patterns learned from large-scale single-cell omics data.
Language embedding injection at the gene level: Rather than conditioning the model on text at the cell or dataset level, scGenePT injects language embeddings at the level of individual gene tokens. Each gene in the scRNA-seq input receives an additional embedding derived from text descriptions of that gene, allowing the model to combine experimental expression context with curated biological knowledge for every gene simultaneously.
Three complementary knowledge sources: scGenePT tests three distinct text sources — NCBI gene summary descriptions (gene-level information), UniProt protein summaries (protein-level function and biochemistry for protein-coding genes), and Gene Ontology annotations across Molecular Function, Biological Process, and Cellular Component axes. Each source encodes a different facet of gene biology, and their effects on perturbation prediction performance are distinct and informative.
Subcellular localization improves single-gene perturbation prediction: Among the tested knowledge sources, GO Cellular Component annotations (encoding where in the cell a protein is localized) provide the greatest improvement for predicting single-gene perturbation effects. This finding suggests that knowing where a protein acts constrains the downstream transcriptional consequences of its perturbation in a way that is not captured by co-expression statistics alone.
Protein information improves multi-gene perturbation prediction: For predicting the combined effect of two-gene perturbations where the interaction between perturbed genes matters, UniProt protein summaries — encoding biochemical function, domains, and interaction partners at the molecular level — provide the largest gain. This reflects the fact that combinatorial perturbation effects are often mechanistically mediated by protein-protein interactions and shared pathway membership.
Generalization to unseen gene combinations: scGenePT demonstrates particular value for predicting two-gene perturbation effects when neither gene was observed individually during training. Language embeddings provide prior information about these genes' functions that allows the model to make reasonable predictions in an otherwise zero-shot regime, where a purely data-driven model has no gene-specific signal to draw on.
Built on scGPT whole-human: scGenePT inherits the strong single-cell representations of scGPT, which was pretrained on ~33 million human single-cell transcriptomes, and extends them with language without retraining from scratch. This allows the language injection to be a targeted, compute-efficient modification rather than a complete architecture overhaul, with the resulting model containing approximately 51.3 million parameters.
Additive and complementary language signals: The improvement from language embeddings is additive to the improvement from scRNA-seq pretraining — the two sources of information encode genuinely distinct features of gene biology. This complementarity supports the conclusion that language knowledge encodes mechanistic and functional information that is not derivable from co-expression patterns alone.
scGenePT is built on the scGPT architecture, a GPT-style autoregressive transformer pretrained on approximately 33 million human single-cell transcriptomes from the CELLxGENE Discover corpus. The scGPT model processes each cell as an ordered sequence of (gene, expression value) pairs and is pretrained with a gene expression prediction objective. For perturbation prediction, the model is fine-tuned on Perturb-seq datasets that pair CRISPR perturbations with single-cell RNA-seq readouts.
The scGenePT modification introduces language embeddings at the gene token level. For each gene in the vocabulary, a text representation of that gene is assembled from one of the knowledge sources (NCBI gene card summary, UniProt protein summary, or GO annotation text), and this text is embedded using a pretrained large language model. The resulting gene language embedding is added to the gene's learned representation within the scGPT architecture as an additional prior, allowing the model to access functional annotation knowledge at every position where that gene appears in an expression sequence.
The model was evaluated on two benchmark Perturb-seq datasets from the Norman et al. and Adamson et al. studies, which provide single-gene and two-gene combinatorial CRISPR knockout screens with scRNA-seq readouts in human cell lines. Performance was measured by the Pearson correlation between predicted and observed mean differential expression profiles across held-out perturbations, stratified by whether the perturbed genes appeared in the training set, the validation set, or an out-of-distribution test set.
With approximately 51.3 million total parameters (inherited from scGPT whole-human with minimal additional parameters for the language embedding integration layer), scGenePT is computationally accessible for fine-tuning on standard GPU hardware. The language embeddings themselves are generated offline from the pretrained LLM and stored as fixed vectors, meaning no additional LLM inference is required at fine-tuning or inference time for the perturbation prediction task.
Key quantitative findings include: NCBI and UniProt text embeddings provide statistically significant improvement in perturbation prediction correlation on held-out single-gene perturbations; GO Cellular Component annotations provide the largest gain specifically for predicting single-gene perturbation effects; UniProt protein summaries provide the largest gain for predicting two-gene combinatorial perturbation effects, particularly for unseen gene pairs; and text embeddings alone (without scRNA-seq pretraining) are substantially inferior to the combined model, confirming that language and expression data are genuinely complementary rather than redundant.
scGenePT is designed for researchers running or interpreting CRISPR perturbation screens who want to computationally predict the transcriptional consequences of genetic interventions before committing to experiments, or who want to extrapolate from a partial screen to unstudied perturbations. The primary use case is perturbation prediction generalization: given a Perturb-seq dataset that covers a subset of possible single-gene or two-gene perturbations in a cell type, scGenePT can predict the expression profiles of unseen perturbations by leveraging both the learned expression context from pretraining and the functional annotation context from gene-level language embeddings. This is particularly useful when experimental resources are limited and researchers need to prioritize which perturbations are most likely to have specific desired effects on gene expression programs. In target identification workflows, scGenePT can be used to screen candidate gene targets computationally, predicting which knockdowns are most likely to suppress or activate a transcriptional program associated with disease. For combination therapy research, the model's improved prediction of two-gene combinatorial effects offers a path to prioritizing synergistic or antagonistic gene pairs for experimental validation, reducing the quadratic scaling problem of exhaustive combinatorial screens. The model is also relevant as a scientific tool for studying how biological knowledge is encoded in gene expression data versus curated databases, with its systematic evaluation of different knowledge sources providing practical guidance for future multimodal model development.
scGenePT addresses a fundamental question in biological foundation model design: are large-scale transcriptomics training corpora sufficient to capture all biologically relevant gene information, or does structured curated knowledge from biological databases add independent value? The finding that language embeddings provide additive and complementary value confirms the latter — that scRNA-seq data and biological databases encode genuinely distinct, non-redundant information about gene function. The source-specificity of the gains (subcellular localization for single-gene effects, protein biochemistry for combinatorial effects) provides actionable guidance for future multimodal model development by identifying which types of prior knowledge are most relevant for which types of prediction tasks. scGenePT also makes a methodological contribution by demonstrating that knowledge injection at the gene token level is a viable and computationally efficient strategy for grounding single-cell language models in curated biology, without requiring joint pretraining from scratch on paired text-expression datasets. A key limitation is that the model was fine-tuned and evaluated primarily on Perturb-seq datasets in specific human cell lines, and generalization to other cell types, organisms, or perturbation modalities (e.g., drug treatments rather than CRISPR knockouts) has not been systematically established. The text knowledge sources are also static snapshots of database content, meaning the model cannot dynamically update its prior knowledge as new biological information accumulates without recomputing the gene embeddings. As part of the CZI Virtual Cell Platform ecosystem, scGenePT contributes to a broader research program building towards a comprehensive AI-powered model of cell biology that integrates molecular mechanisms with observable cellular phenotypes.