Framework that converts single-cell gene expression profiles into ranked gene-name sequences, enabling standard LLMs to generate, annotate, and analyze cells.
Cell2Sentence (C2S) is a framework developed at Yale University that bridges single-cell transcriptomics and natural language processing by translating cellular gene expression profiles into plain-text sequences. The central idea is elegantly simple: for each cell, genes are sorted in descending order of expression level, and their names are concatenated into a space-separated string called a "cell sentence." This transformation allows any off-the-shelf large language model (LLM) — without specialized biological architecture — to be fine-tuned on transcriptomic data using standard language modeling objectives.
The method was developed by Daniel Levine, Syed Asad Rizvi, Sacha Lévy, David van Dijk, and colleagues at Yale's Department of Computer Science and School of Medicine, with collaborators from EPFL, the University of Pennsylvania, and Google. The work was published at the 41st International Conference on Machine Learning (ICML 2024) in PMLR volume 235. A companion preprint on C2S-Scale extended the approach to models up to 27 billion parameters trained on over 57 million human and mouse cells.
Cell2Sentence occupies a distinct niche among single-cell foundation models. Unlike architectures such as scGPT or Geneformer that were designed from the ground up for transcriptomics, C2S deliberately avoids biological-specific inductive biases. Instead, it exploits the representational knowledge already encoded in language-pretrained models, then fine-tunes them on biological data. This positions C2S as both a practical tool and a conceptual demonstration that the text modality is a viable and powerful substrate for single-cell biology.
In the original C2S framework, GPT-2 variants and Pythia-160m serve as the backbone models, fine-tuned with standard causal language modeling (cross-entropy loss) on cell sentences. Sequence lengths ranged from 100 genes for resource-constrained experiments to up to 9,200 tokens for full-length sequences. Three main training datasets were used: an immune tissue atlas of 273,502 cells spanning 35 cell types (Conde et al., 2022), a cytokine stimulation dataset of approximately 29,500 cells with 140 combinatorial labels (Dong et al., 2023), and a multi-tissue CellxGene corpus of 37 million cells across 99 human studies.
Benchmark evaluations demonstrated strong performance across tasks. For cell type generation on the immune tissue dataset, C2S (Pythia-160m) achieved a k-NN accuracy of 0.2746 (k=10), outperforming scGen (0.2377), scVI (0.2425), and scGPT (0.1811). On Gromov-Wasserstein distance — a measure of distributional fidelity — C2S scored 54.30 versus 72.02 for scDiffusion. For combinatorial cell label prediction on the cytokine stimulation dataset, C2S reached 63.9% accuracy versus 60.0% for Geneformer and 41.9% for scGPT. Perturbation prediction performance was particularly notable: C2S achieved Pearson R of 0.9734 and Spearman R of 0.9752 for predicting differentially expressed genes, substantially outperforming scGen (Pearson R 0.7187) and scGPT (Pearson R 0.1299). The C2S-Scale extension scales to 27B parameters and trains on over 57 million human and mouse cells, with variants at 1B, 2B, and 27B parameter counts based on Gemma-2.
Cell2Sentence is designed for researchers working with single-cell RNA-sequencing data who want to leverage modern LLM infrastructure for biological analysis. Primary applications include automated cell type annotation — replacing manual marker-gene inspection with prompt-based classification — and conditional cell generation, where a researcher specifies a cell type label and the model synthesizes a plausible transcriptomic profile. The perturbation prediction capability is particularly relevant for drug discovery: given a baseline cell sentence, C2S can predict how the expression profile would change following a genetic or chemical perturbation. The abstract generation feature is an unusual capability that allows the model to produce natural language descriptions of a cell's biological context directly from its expression data, enabling downstream reasoning with general-purpose LLMs. The framework integrates with standard Python data science tooling (AnnData, Scanpy) and is available as a PyPI package.
Cell2Sentence provided an influential proof of concept that single-cell biology could be reframed as a language modeling problem without sacrificing performance — and in several benchmarks, outperforming purpose-built biological foundation models. Its publication at ICML 2024 brought the approach to the attention of the machine learning community, and the C2S-Scale extension demonstrates that the framework benefits from scaling in ways consistent with general language model scaling laws. A notable limitation is that the rank-ordering transformation discards absolute expression magnitude, retaining only relative ordering within a cell; the reversible approximation recovers much of this information but introduces lossy compression. Additionally, cell sentences can be long (thousands of tokens), making full-sequence fine-tuning computationally expensive for very large models. Nonetheless, Cell2Sentence established that the linguistic framing of transcriptomics is scientifically productive, inspiring further work on unified text-and-biology models and multi-modal biological LLMs.
Levine, D., et al. (2024) Cell2Sentence: Teaching Large Language Models the Language of Biology. bioRxiv.
DOI: 10.1101/2023.09.11.557287