Overview

Cell2Sentence (C2S) is a framework developed at Yale University that bridges single-cell transcriptomics and natural language processing by translating cellular gene expression profiles into plain-text sequences. The central idea is elegantly simple: for each cell, genes are sorted in descending order of expression level, and their names are concatenated into a space-separated string called a "cell sentence." This transformation allows any off-the-shelf large language model (LLM) — without specialized biological architecture — to be fine-tuned on transcriptomic data using standard language modeling objectives.

The method was developed by Daniel Levine, Syed Asad Rizvi, Sacha Lévy, David van Dijk, and colleagues at Yale's Department of Computer Science and School of Medicine, with collaborators from EPFL, the University of Pennsylvania, and Google. The work was published at the 41st International Conference on Machine Learning (ICML 2024) in PMLR volume 235. A companion preprint on C2S-Scale extended the approach to models up to 27 billion parameters trained on over 57 million human and mouse cells.

Cell2Sentence occupies a distinct niche among single-cell foundation models. Unlike architectures such as scGPT or Geneformer that were designed from the ground up for transcriptomics, C2S deliberately avoids biological-specific inductive biases. Instead, it exploits the representational knowledge already encoded in language-pretrained models, then fine-tunes them on biological data. This positions C2S as both a practical tool and a conceptual demonstration that the text modality is a viable and powerful substrate for single-cell biology.

Key Features

Rank-ordered cell sentences: Gene expression profiles are converted to text by ordering gene names by descending expression level, producing a sequence that encodes relative expression magnitude in word position rather than numerical values.
Architecture-agnostic compatibility: The cell sentence format is compatible with any autoregressive or masked language model. GPT-2 (small, medium, large), Pythia-160m, Gemma-2 (1B, 2B, 27B), and other standard models have been fine-tuned using this approach without architectural modification.
Reversible transformation: A linear regression model maps the log-rank ordering back to approximate expression values, recovering continuous gene expression with R² = 0.815, making C2S outputs interpretable in conventional transcriptomic units.
Multi-task biological inference: A single fine-tuned C2S model supports cell type prediction, conditional cell generation, perturbation response prediction, and natural language abstract generation from transcriptomic inputs.
Natural language pretraining benefit: Retaining the text pretraining of the base model before fine-tuning on cell sentences consistently improves performance across biological tasks, suggesting that language priors transfer usefully to gene expression data.
Custom BPE tokenizer: A Byte Pair Encoding tokenizer was trained on cell sentence corpora, producing a vocabulary of 9,609 tokens optimized for gene name sequences.

Technical Details

In the original C2S framework, GPT-2 variants and Pythia-160m serve as the backbone models, fine-tuned with standard causal language modeling (cross-entropy loss) on cell sentences. Sequence lengths ranged from 100 genes for resource-constrained experiments to up to 9,200 tokens for full-length sequences. Three main training datasets were used: an immune tissue atlas of 273,502 cells spanning 35 cell types (Conde et al., 2022), a cytokine stimulation dataset of approximately 29,500 cells with 140 combinatorial labels (Dong et al., 2023), and a multi-tissue CellxGene corpus of 37 million cells across 99 human studies.

Benchmark evaluations demonstrated strong performance across tasks. For cell type generation on the immune tissue dataset, C2S (Pythia-160m) achieved a k-NN accuracy of 0.2746 (k=10), outperforming scGen (0.2377), scVI (0.2425), and scGPT (0.1811). On Gromov-Wasserstein distance — a measure of distributional fidelity — C2S scored 54.30 versus 72.02 for scDiffusion. For combinatorial cell label prediction on the cytokine stimulation dataset, C2S reached 63.9% accuracy versus 60.0% for Geneformer and 41.9% for scGPT. Perturbation prediction performance was particularly notable: C2S achieved Pearson R of 0.9734 and Spearman R of 0.9752 for predicting differentially expressed genes, substantially outperforming scGen (Pearson R 0.7187) and scGPT (Pearson R 0.1299). The C2S-Scale extension scales to 27B parameters and trains on over 57 million human and mouse cells, with variants at 1B, 2B, and 27B parameter counts based on Gemma-2.

Applications

Cell2Sentence is designed for researchers working with single-cell RNA-sequencing data who want to leverage modern LLM infrastructure for biological analysis. Primary applications include automated cell type annotation — replacing manual marker-gene inspection with prompt-based classification — and conditional cell generation, where a researcher specifies a cell type label and the model synthesizes a plausible transcriptomic profile. The perturbation prediction capability is particularly relevant for drug discovery: given a baseline cell sentence, C2S can predict how the expression profile would change following a genetic or chemical perturbation. The abstract generation feature is an unusual capability that allows the model to produce natural language descriptions of a cell's biological context directly from its expression data, enabling downstream reasoning with general-purpose LLMs. The framework integrates with standard Python data science tooling (AnnData, Scanpy) and is available as a PyPI package.

Impact

Cell2Sentence provided an influential proof of concept that single-cell biology could be reframed as a language modeling problem without sacrificing performance — and in several benchmarks, outperforming purpose-built biological foundation models. Its publication at ICML 2024 brought the approach to the attention of the machine learning community, and the C2S-Scale extension demonstrates that the framework benefits from scaling in ways consistent with general language model scaling laws. A notable limitation is that the rank-ordering transformation discards absolute expression magnitude, retaining only relative ordering within a cell; the reversible approximation recovers much of this information but introduces lossy compression. Additionally, cell sentences can be long (thousands of tokens), making full-sequence fine-tuning computationally expensive for very large models. Nonetheless, Cell2Sentence established that the linguistic framing of transcriptomics is scientifically productive, inspiring further work on unified text-and-biology models and multi-modal biological LLMs.

Overview

Key Features

Rank-ordered cell sentences: Gene expression profiles are converted to text by ordering gene names by descending expression level, producing a sequence that encodes relative expression magnitude in word position rather than numerical values.

Architecture-agnostic compatibility: The cell sentence format is compatible with any autoregressive or masked language model. GPT-2 (small, medium, large), Pythia-160m, Gemma-2 (1B, 2B, 27B), and other standard models have been fine-tuned using this approach without architectural modification.

Reversible transformation: A linear regression model maps the log-rank ordering back to approximate expression values, recovering continuous gene expression with R² = 0.815, making C2S outputs interpretable in conventional transcriptomic units.

Multi-task biological inference: A single fine-tuned C2S model supports cell type prediction, conditional cell generation, perturbation response prediction, and natural language abstract generation from transcriptomic inputs.

Natural language pretraining benefit: Retaining the text pretraining of the base model before fine-tuning on cell sentences consistently improves performance across biological tasks, suggesting that language priors transfer usefully to gene expression data.

Custom BPE tokenizer: A Byte Pair Encoding tokenizer was trained on cell sentence corpora, producing a vocabulary of 9,609 tokens optimized for gene name sequences.

Technical Details

Applications

Impact

Cell2Sentence

Overview

Key Features

Technical Details

Applications

Impact

Citation

Cell2Sentence: Teaching Large Language Models the Language of Biology

Metrics

GitHub

Citations

Tags

Resources

Cell2Sentence

Overview

Key Features

Technical Details

Applications

Impact

Citation

Cell2Sentence: Teaching Large Language Models the Language of Biology

Metrics

GitHub

Citations

Tags

Resources