Overview

GPTCelltype is an open-source R package developed by Wenpin Hou (Columbia University) and Zhicheng Ji (Duke University) that automates cell type annotation in single-cell RNA sequencing (scRNA-seq) analysis by querying GPT-4 with marker gene information. Published in Nature Methods in March 2024, the work establishes that a large language model trained on general text corpora can reliably assign biologically meaningful cell type labels to transcriptomic clusters — a task that has historically required expert manual curation or the construction of curated reference datasets.

Cell type annotation is a bottleneck in virtually every scRNA-seq study. After clustering cells by their transcriptional profiles, researchers must inspect the top differentially expressed genes in each cluster and match them to known markers. This process demands domain knowledge spanning diverse tissues and species, is time-consuming, and introduces subjectivity when different experts interpret ambiguous marker combinations differently. GPTCelltype reframes this as a natural language task: it formats marker genes and tissue context into a structured prompt, submits it to the GPT-4 API, and returns a cell type label that can be directly incorporated into the analysis.

Hou and Ji systematically benchmarked GPTCelltype across ten published scRNA-seq datasets spanning five species (human, mouse, and non-model mammals, birds, and reptiles) and hundreds of distinct tissue and cell types, including both normal and cancer samples. The results demonstrated that GPT-4's annotations matched expert manual labels in more than 75% of cell types across most datasets and tissues — a level of concordance competitive with, and in some cases exceeding, dedicated automated annotation tools that require organism-specific reference atlases.

Key Features

Reference-free annotation: GPTCelltype requires no pre-built reference atlas or labeled training data, making it immediately applicable to any tissue, species, or disease context, including organisms with limited published single-cell data.
Seurat and Scanpy compatibility: The package integrates directly into standard single-cell workflows by accepting the output of Seurat's FindAllMarkers() function as input, requiring minimal changes to existing analysis pipelines.
Optimal marker gene selection: Systematic benchmarking identified the top 10 differentially expressed genes (ranked by two-sided Wilcoxon test p-value) as the optimal input — providing sufficient signal without overwhelming the prompt with noise.
Flexible prompt strategies: Three prompt modes — a basic prompt, a chain-of-thought-inspired prompt that includes reasoning steps, and a repeated-query prompt with majority voting — all yield comparable performance, with the basic prompt recommended for cost efficiency.
API key optional: If no OpenAI API key is provided, GPTCelltype outputs the formatted prompt itself, allowing users to paste it into the ChatGPT web interface — a useful fallback for users without programmatic API access.
Low cost and high throughput: Total API costs for all ten benchmarking studies amounted to less than $0.10, with cost scaling linearly with the number of clusters, making the approach economically accessible even for large atlasing projects.

Technical Details

GPTCelltype does not train or fine-tune a model; it uses GPT-4 (specifically the June 13, 2023 API version used in the study) as a zero-shot annotator via the OpenAI API. The core function, gptcelltype(), takes either a Seurat differential expression result or a custom named list of marker genes as input, optionally paired with a tissue name that is included in the prompt to narrow GPT-4's prediction space. The prompt sent to the API follows the pattern: "Identify cell types of [TissueName] cells using the following markers separately for each row. Only provide the cell type name."

Across ten benchmark datasets, GPTCelltype achieved greater than 75% full or partial concordance with manual annotations. When asked to discriminate whether a cluster represented a single cell type versus a mixture of types, accuracy was 93%. When asked to distinguish known from unknown (out-of-distribution) cell populations, accuracy reached 99%. Reproducibility across repeated independent API queries was 85%, with Cohen's Kappa of 0.65 (substantial agreement) when comparing annotation consistency across model versions. Performance was robust to 50% random subsampling of marker genes and showed only slight degradation when 75% of input genes were replaced with noise, though accuracy dropped for clusters containing fewer than 10 cells where differential expression signals are weaker. The package depends on the R openai library and requires R version 3.5 or higher.

Applications

GPTCelltype is well-suited to any scRNA-seq study where manual annotation would otherwise require consulting multiple tissue-specific references or engaging domain experts outside the immediate research team. It is particularly valuable for multi-tissue atlas projects, cross-species comparative studies, and cancer datasets where cell type identities may not map cleanly onto healthy-tissue reference atlases. By returning granular labels — for example, distinguishing fibroblasts, osteoblasts, and chondrocytes where a human annotator might assign a generic "stromal cell" label — GPTCelltype can surface biologically meaningful heterogeneity that coarser annotation frameworks miss. Researchers use it as a rapid first-pass annotation step, with human review recommended before downstream analysis, especially in novel or undercharacterized tissue contexts where AI hallucination risk is higher.

Impact

GPTCelltype attracted significant attention as one of the first systematic demonstrations that a general-purpose large language model could perform a specialized computational biology task at expert level without domain-specific fine-tuning. The Nature Methods paper accumulated hundreds of citations following its publication in March 2024, and the approach has influenced a broader discussion about the role of LLMs in bioinformatics annotation tasks. The work also spurred follow-on tools and comparisons: a subsequent benchmarking study in Briefings in Bioinformatics evaluated multiple LLMs for cell type annotation, situating GPTCelltype as a key reference baseline. A notable limitation is that GPT-4's training corpus is undisclosed, making it difficult to determine the basis for specific annotations or to audit for biases in underrepresented tissue types. The risk of confident but incorrect annotations — AI hallucination — means that expert review remains essential, particularly for rare or ambiguous cell populations.

Overview

Key Features

Reference-free annotation: GPTCelltype requires no pre-built reference atlas or labeled training data, making it immediately applicable to any tissue, species, or disease context, including organisms with limited published single-cell data.

Seurat and Scanpy compatibility: The package integrates directly into standard single-cell workflows by accepting the output of Seurat's FindAllMarkers() function as input, requiring minimal changes to existing analysis pipelines.

Optimal marker gene selection: Systematic benchmarking identified the top 10 differentially expressed genes (ranked by two-sided Wilcoxon test p-value) as the optimal input — providing sufficient signal without overwhelming the prompt with noise.

Flexible prompt strategies: Three prompt modes — a basic prompt, a chain-of-thought-inspired prompt that includes reasoning steps, and a repeated-query prompt with majority voting — all yield comparable performance, with the basic prompt recommended for cost efficiency.

API key optional: If no OpenAI API key is provided, GPTCelltype outputs the formatted prompt itself, allowing users to paste it into the ChatGPT web interface — a useful fallback for users without programmatic API access.

Low cost and high throughput: Total API costs for all ten benchmarking studies amounted to less than $0.10, with cost scaling linearly with the number of clusters, making the approach economically accessible even for large atlasing projects.

Technical Details

Applications

Impact

GPTCelltype

Overview

Key Features

Technical Details

Applications

Impact

Citation

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Metrics

GitHub

Citations

Tags

Resources

GPTCelltype

Overview

Key Features

Technical Details

Applications

Impact

Citation

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Metrics

GitHub

Citations

Tags

Resources