CellTosg2Sequence is a single-cell large language model that grounds a general-purpose text backbone in structured biomedical knowledge for transcriptomic analysis. It tackles a recurring weakness of "cell sentence" approaches — which serialize a cell's gene expression profile into an ordered token sequence that a language model can read — namely that the resulting model has no explicit notion of how genes, pathways, and cell ontologies relate to one another. By injecting that relational structure directly into the model, CellTosg2Sequence aims to make single-cell predictions both more accurate and more biologically interpretable.

The central idea is to encode a curated 62,507-node biomedical knowledge graph into a small set of compact "virtual tokens" that are prepended to each cell sentence, conditioning the language model on biological structure before it ever sees the expression data. The name plays on Cell2Sentence, a separate cell-sentence framework from a different group, and the model is a successor to the same lab's earlier OmniCellTOSG / CellTOSG-FM work on text-omics-signaling graphs. It was developed by the Fuhai Li lab at Washington University in St. Louis and released as a bioRxiv preprint on June 22, 2026 under a CC BY license.

Key Features

Knowledge-graph virtual tokens: A lightweight heterogeneous graph encoder compresses a 62,507-node biomedical knowledge graph into compact tokens that are prepended to each cell sentence, conditioning the language model on gene, pathway, and ontology relationships.
Qwen2.5-32B backbone with LoRA: Rather than training from scratch, the model adapts a 32B-parameter general-purpose LLM using low-rank adaptation, keeping training requirements lightweight relative to the backbone's size.
Three-stage training curriculum: Stage I anchors the knowledge-graph channel under autoregressive language-model pretraining; Stage II aligns labels via supervised fine-tuning combined with a knowledge-graph-anchored InfoNCE contrastive objective; Stage III applies GRPO reinforcement learning with an ontology-hierarchy reward.
Open-vocabulary cell-type prediction: The reinforcement-learning stage rewards predictions according to their position in the cell-type ontology hierarchy, enabling open-vocabulary annotation rather than classification over a fixed label set.
Single unified checkpoint: All reported results come from one model checkpoint, avoiding per-task specialization and demonstrating multi-task capability from a single set of weights.

Technical Details

The architecture couples a heterogeneous graph encoder over a 62,507-node biomedical knowledge graph with a Qwen2.5-32B transformer language model fine-tuned via LoRA. Each cell is represented as a cell sentence, and the graph-derived virtual tokens are prepended so that biological structure conditions every prediction. Training proceeds in three stages: autoregressive language-model pretraining to anchor the knowledge-graph channel (Stage I), supervised fine-tuning with a knowledge-graph-anchored InfoNCE contrastive loss to align labels (Stage II), and Group Relative Policy Optimization (GRPO) with an ontology-hierarchy reward for open-vocabulary cell-type prediction (Stage III). The model is trained on single-cell corpora drawn from the Human Cell Atlas and Tahoe-100M. The authors report that the approach outperforms existing baselines while keeping training requirements lightweight, with all results produced from a single unified checkpoint; as a preprint, these benchmarks await peer review.

Applications

CellTosg2Sequence targets single-cell transcriptomics workflows, most directly open-vocabulary cell-type annotation, where its ontology-aware reward lets it place cells within a biological hierarchy rather than forcing a choice among predefined labels. Researchers analyzing scRNA-seq data from atlases such as the Human Cell Atlas or large perturbation resources like Tahoe-100M can use the model to obtain annotations grounded in curated pathway and ontology knowledge. The knowledge-graph conditioning is intended to improve interpretability, making it well suited to computational biologists who need predictions that connect back to known gene and pathway relationships.

Impact

CellTosg2Sequence is a recent contribution to a fast-moving area at the intersection of large language models and single-cell biology, where cell-sentence methods (such as Cell2Sentence and ChatCell) and graph-augmented foundation models are converging. Its distinctive bet is that injecting a curated biomedical knowledge graph as virtual tokens, then refining behavior with ontology-aware reinforcement learning, can deliver open-vocabulary annotation from a single lightweight-to-train checkpoint built on an off-the-shelf 32B LLM. As a June 2026 preprint without a released code repository or public model weights at the time of cataloging, its real-world adoption remains to be established and its reported gains over baselines have not yet been independently validated.

Key Features

Knowledge-graph virtual tokens: A lightweight heterogeneous graph encoder compresses a 62,507-node biomedical knowledge graph into compact tokens that are prepended to each cell sentence, conditioning the language model on gene, pathway, and ontology relationships.

Qwen2.5-32B backbone with LoRA: Rather than training from scratch, the model adapts a 32B-parameter general-purpose LLM using low-rank adaptation, keeping training requirements lightweight relative to the backbone's size.

Three-stage training curriculum: Stage I anchors the knowledge-graph channel under autoregressive language-model pretraining; Stage II aligns labels via supervised fine-tuning combined with a knowledge-graph-anchored InfoNCE contrastive objective; Stage III applies GRPO reinforcement learning with an ontology-hierarchy reward.

Open-vocabulary cell-type prediction: The reinforcement-learning stage rewards predictions according to their position in the cell-type ontology hierarchy, enabling open-vocabulary annotation rather than classification over a fixed label set.

Single unified checkpoint: All reported results come from one model checkpoint, avoiding per-task specialization and demonstrating multi-task capability from a single set of weights.

Technical Details

Applications

Impact

CellTosg2Sequence

Key Features

Technical Details

Applications

Impact

Citation