A T5-based conversational framework that converts scRNA-seq data into cell sentences, enabling cell type annotation, pseudo-cell generation, and drug sensitivity prediction via natural language.
ChatCell is a multimodal framework developed by ZJUNlp that bridges single-cell RNA sequencing (scRNA-seq) analysis and natural language processing, allowing researchers to interact with gene expression data through a conversational interface. Rather than requiring bespoke bioinformatics pipelines for each analysis task, ChatCell accepts plain-language prompts and returns biologically grounded answers — a significant departure from the command-line workflows that have historically gatekept single-cell genomics.
The central innovation is the Cell2Sentence technique, which converts raw gene expression matrices into ordered sequences of gene tokens — "cell sentences" — that encode a cell's transcriptional identity in a form that a language model can process. This representation is paired with a vocabulary adaptation strategy that enriches the base model with single-cell biology terminology, grounding the linguistic representations in domain-specific meaning. The result is a system that can execute diverse analytical tasks within a single unified sequence-to-sequence framework, introduced in a preprint released in February 2024.
ChatCell is built on the T5 (Text-to-Text Transfer Transformer) architecture, an encoder-decoder model that frames all tasks as sequence-to-sequence problems. This design is well-suited to the multi-task setting: the same model can produce a cell type label, a synthesized gene expression profile, or a drug response prediction depending on the prompt template provided at inference time.
Training proceeds in two stages. The first stage is vocabulary adaptation, in which the model is pre-trained to incorporate single-cell lexicon so that gene identifiers map to meaningful token embeddings. The second stage is instruction fine-tuning on the ChatCell-Instructions dataset, which combines scRNA-seq profiles with natural language task descriptions and expected outputs. Input data is first preprocessed from standard scRNA-seq formats into cell sentence representations, then combined with structured prompt templates before being passed to the encoder. The model is available in three parameter scales on HuggingFace; exact parameter counts for each variant are not disclosed in the preprint.
Benchmark performance on the constituent tasks — particularly cell type annotation — is reported in the preprint against standard baselines, with the large variant outperforming the smaller configurations on classification accuracy and generating more biologically plausible pseudo-cells.
ChatCell is designed for researchers working with single-cell transcriptomics who want to reduce reliance on bespoke computational pipelines. Cell type annotation is the most immediate application: users can provide a cell sentence and ask the model to classify the cell, making it useful for annotating large atlases or characterizing novel cell populations. The pseudo-cell generation capability supports data augmentation for downstream machine learning tasks and enables exploration of transcriptional programs associated with specific cell types. Drug sensitivity prediction extends the framework into translational contexts, where understanding how individual cell populations respond to a compound is relevant to both drug discovery and personalized oncology. Because the interface is conversational, interdisciplinary teams — including experimental biologists without deep bioinformatics training — can perform exploratory analyses without writing custom code.
ChatCell contributes to a growing body of work applying large language models to genomics data by demonstrating that the cell sentence representation is an effective bridge between continuous expression measurements and discrete language model inputs. The framework's multi-task design — handling annotation, generation, and prediction within a single model — distinguishes it from domain-specific tools that address only one task at a time. As of early 2024, the work exists as a preprint and has not yet undergone peer review, which means benchmark comparisons should be interpreted with appropriate caution. A key limitation is that Cell2Sentence representation discards quantitative expression magnitude in favor of ranked gene ordering, which may lose information relevant to graded biological processes. The approach also inherits the scaling constraints of T5, and performance on very large or highly heterogeneous atlases remains to be established in the literature.
Fang, Y., Liu, K., Zhang, N., Deng, X., Yang, P., Chen, Z., Tang, X., Gerstein, M., Fan, X., & Chen, H. (2024). ChatCell: Facilitating Single-Cell Analysis with Natural Language. arXiv preprint arXiv:2402.08303.
DOI: 10.48550/arXiv.2402.08303