ChatCell

Conversational T5-based framework that turns scRNA-seq data into cell sentences for cell type annotation and drug sensitivity prediction.

Released: February 2024

ChatCell is a multimodal framework developed by ZJUNlp that bridges single-cell RNA sequencing (scRNA-seq) analysis and natural language processing, allowing researchers to interact with gene expression data through a conversational interface. Rather than requiring bespoke bioinformatics pipelines for each analysis task, ChatCell accepts plain-language prompts and returns biologically grounded answers — a significant departure from the command-line workflows that have historically gatekept single-cell genomics.

The central innovation is the Cell2Sentence technique, which converts raw gene expression matrices into ordered sequences of gene tokens — "cell sentences" — that encode a cell's transcriptional identity in a form that a language model can process. This representation is paired with a vocabulary adaptation strategy that enriches the base model with single-cell biology terminology, grounding the linguistic representations in domain-specific meaning. The result is a system that can execute diverse analytical tasks within a single unified sequence-to-sequence framework, introduced in a preprint released in February 2024.

Key Features

Cell2Sentence encoding: Converts scRNA-seq count matrices into structured gene token sequences, translating the continuous space of gene expression into discrete cell sentences that a text model can natively process.
Vocabulary adaptation: Extends the T5 vocabulary with single-cell biology terms prior to instruction fine-tuning, ensuring that gene names and cell type labels are treated as first-class tokens rather than decomposed into subword units.
Multi-task conversational interface: Handles cell type annotation, pseudo-cell generation, random cell sentence generation, and drug sensitivity prediction through a single model and a unified natural language prompt format.
Instruction-tuned training: Fine-tuned on the ChatCell-Instructions dataset, derived from three public scRNA-seq datasets (SHARE-seq mouse skin, GSE117872, and GSE149383), giving the model exposure to diverse tissue contexts and cell populations.
Multiple model scales: Released in small, base, and large configurations on HuggingFace, allowing practitioners to select the variant that best matches their computational budget and accuracy requirements.

Technical Details

ChatCell is built on the T5 (Text-to-Text Transfer Transformer) architecture, an encoder-decoder model that frames all tasks as sequence-to-sequence problems. This design is well-suited to the multi-task setting: the same model can produce a cell type label, a synthesized gene expression profile, or a drug response prediction depending on the prompt template provided at inference time.

Training proceeds in two stages. The first stage is vocabulary adaptation, in which the model is pre-trained to incorporate single-cell lexicon so that gene identifiers map to meaningful token embeddings. The second stage is instruction fine-tuning on the ChatCell-Instructions dataset, which combines scRNA-seq profiles with natural language task descriptions and expected outputs. Input data is first preprocessed from standard scRNA-seq formats into cell sentence representations, then combined with structured prompt templates before being passed to the encoder. The model is available in three parameter scales on HuggingFace; exact parameter counts for each variant are not disclosed in the preprint.

Benchmark performance on the constituent tasks — particularly cell type annotation — is reported in the preprint against standard baselines, with the large variant outperforming the smaller configurations on classification accuracy and generating more biologically plausible pseudo-cells.

Applications

ChatCell is designed for researchers working with single-cell transcriptomics who want to reduce reliance on bespoke computational pipelines. Cell type annotation is the most immediate application: users can provide a cell sentence and ask the model to classify the cell, making it useful for annotating large atlases or characterizing novel cell populations. The pseudo-cell generation capability supports data augmentation for downstream machine learning tasks and enables exploration of transcriptional programs associated with specific cell types. Drug sensitivity prediction extends the framework into translational contexts, where understanding how individual cell populations respond to a compound is relevant to both drug discovery and personalized oncology. Because the interface is conversational, interdisciplinary teams — including experimental biologists without deep bioinformatics training — can perform exploratory analyses without writing custom code.

Impact

ChatCell contributes to a growing body of work applying large language models to genomics data by demonstrating that the cell sentence representation is an effective bridge between continuous expression measurements and discrete language model inputs. The framework's multi-task design — handling annotation, generation, and prediction within a single model — distinguishes it from domain-specific tools that address only one task at a time. As of early 2024, the work exists as a preprint and has not yet undergone peer review, which means benchmark comparisons should be interpreted with appropriate caution. A key limitation is that Cell2Sentence representation discards quantitative expression magnitude in favor of ranked gene ordering, which may lose information relevant to graded biological processes. The approach also inherits the scaling constraints of T5, and performance on very large or highly heterogeneous atlases remains to be established in the literature.

Citation

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Preprint

Fang, Y., Liu, K., Zhang, N., Deng, X., Yang, P., Chen, Z., Tang, X., Gerstein, M., Fan, X., & Chen, H. (2024). ChatCell: Facilitating Single-Cell Analysis with Natural Language. arXiv preprint arXiv:2402.08303.

DOI: 10.48550/arXiv.2402.08303

Recent citations

Papers that recently cited this model.

Transformative advances in single-cell omics: a comprehensive review of foundation models, multimodal integration and computational ecosystems
T. Yiu, Bin Chen, Haoyu Wang, et al.
Journal of Translational Medicine · Oct 2025
14
LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, et al.
arXiv.org · Oct 2025
2
SCassist: An AI Based Workflow Assistant for Single-Cell Analysis
Vijayaraj Nagarajan, Guangpu Shi, Samyuktha Arunkumar, et al.
bioRxiv · Apr 2025
4

Top citations

The most-cited papers that cite this model.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al.
arXiv.org · Mar 2024
27
Transformative advances in single-cell omics: a comprehensive review of foundation models, multimodal integration and computational ecosystems
T. Yiu, Bin Chen, Haoyu Wang, et al.
Journal of Translational Medicine · Oct 2025
14
scRAG: Hybrid Retrieval-Augmented Generation for LLM-based Cross-Tissue Single-Cell Annotation
Zhiyin Yu, C. Zheng, Chong Chen, et al.
Annual Meeting of the Association for Computational Linguistics · 2025
5Influential
SCassist: An AI Based Workflow Assistant for Single-Cell Analysis
Vijayaraj Nagarajan, Guangpu Shi, Samyuktha Arunkumar, et al.
bioRxiv · Apr 2025
4
dnaGrinder: a lightweight and high-capacity genomic foundation model
Qihang Zhao, Chi Zhang, Weixiong Zhang
arXiv.org · Sep 2024
4

Citations

Total Citations6

Influential1

References51

GitHub

Stars51

Forks8

Open Issues0

Contributors3

Last Push1y ago

LanguagePython

HuggingFace

Downloads9

Likes5

Last Modified2y ago

Pipelinetext-generation

Fields of citing research

Biology100%
Computer Science100%
Medicine50%
Chemistry17%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

19Closed

Usability — can I run it?21

Reproducibility — can I retrain it?12

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Cell2Sentence encoding: Converts scRNA-seq count matrices into structured gene token sequences, translating the continuous space of gene expression into discrete cell sentences that a text model can natively process.

Vocabulary adaptation: Extends the T5 vocabulary with single-cell biology terms prior to instruction fine-tuning, ensuring that gene names and cell type labels are treated as first-class tokens rather than decomposed into subword units.

Multi-task conversational interface: Handles cell type annotation, pseudo-cell generation, random cell sentence generation, and drug sensitivity prediction through a single model and a unified natural language prompt format.

Instruction-tuned training: Fine-tuned on the ChatCell-Instructions dataset, derived from three public scRNA-seq datasets (SHARE-seq mouse skin, GSE117872, and GSE149383), giving the model exposure to diverse tissue contexts and cell populations.

Multiple model scales: Released in small, base, and large configurations on HuggingFace, allowing practitioners to select the variant that best matches their computational budget and accuracy requirements.

Technical Details

Applications

Impact

Citation

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Preprint

DOI: 10.48550/arXiv.2402.08303

ChatCell

#Key Features

#Technical Details

#Applications

#Impact

Citation

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ChatCell

#Key Features

#Technical Details

#Applications

#Impact

Citation

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact