RVQ-Alpha

Single-cell foundation model that tokenizes scRNA-seq into 10 tokens in a Qwen3-4B vocabulary for cell type annotation and perturbation prediction.

Released: April 2026

Parameters: 4 Billion

RVQ-Alpha is a framework for connecting single-cell transcriptomics to large language models (LLMs) by representing each cell as a short sequence of discrete tokens that live directly inside the LLM vocabulary. Single-cell RNA sequencing (scRNA-seq) produces continuous, high-dimensional expression profiles that do not map cleanly onto the discrete token streams LLMs expect. Prior LLM-for-biology approaches either serialize cells into long text "sentences" of ranked gene names or attach continuous embeddings through a separate encoder, both of which inflate sequence length and leave the model prone to hallucinating biological claims that are not grounded in the underlying measurements.

RVQ-Alpha addresses this with a residual vector quantization (RVQ) tokenizer that compresses each cell into a fixed 10-token sequence, embedding the new cell tokens natively in the vocabulary of a Qwen3-4B base model. A single autoregressive model can then interpret existing cell states (for example, annotating cell type) and generate new ones (for example, predicting a post-perturbation profile), with the generated tokens decoded back into expression space by the RVQ decoder.

The model was developed by researchers at Guangzhou National Laboratory and posted to bioRxiv in April 2026. The released artifact is a fixed Qwen3-4B checkpoint produced by continued pretraining, supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR) on scRNA-seq data.

Key Features

Compact discrete cell tokenization: An eight-codebook residual quantizer (32 entries per codebook) encodes each cell in just 10 tokens, roughly 3.4x fewer than prior discrete tokenization methods, where earlier codebooks capture broad identity and later codebooks refine within-lineage variation.
Vocabulary-native cell tokens: Cell tokens are embedded directly in the LLM vocabulary rather than passed through a separate encoder, letting one autoregressive model both interpret and generate cell states.
Evidence-first reasoning (scCoT-Synth): A teacher-student engine grounds the newly added biological tokens through "evidence-before-conclusion" chain-of-thought, reducing unsupported claims.
Fact-Aware RLVR: A verifiable-reward reinforcement learning stage pairs an ontology-grounded answer judge with saliency-weighted verification of biological claims against the actual expression data.
Generative and discriminative in one model: The same checkpoint supports cell type annotation and autoregressive generation of post-perturbation cell states.

Technical Details

RVQ-Alpha is built on the Qwen3-4B transformer (~4 billion parameters). Each cell is quantized by eight residual codebooks of 32 entries each into a fixed 10-token representation that is inserted into the LLM vocabulary, reducing sequence length substantially relative to text-based gene-name serialization. Training proceeds in three stages: continued pretraining to integrate the cell tokens, supervised fine-tuning with scCoT-Synth-generated evidence-first reasoning traces, and a Fact-Aware RLVR stage combining an ontology-grounded answer judge with saliency-weighted claim verification. The model is evaluated across eight held-out, out-of-distribution (OOD) datasets, where it improves OOD generalization and rare-cell recognition; ablation studies report that evidence-first grounding reduces hallucination by more than fivefold relative to baselines.

Applications

RVQ-Alpha targets single-cell analysis workflows that benefit from natural-language interaction, including cell type annotation on unseen datasets and prediction of post-perturbation cell states for in-silico screening. By generating reasoning that cites supporting expression evidence, it is positioned for settings where computational biologists need annotations and hypotheses that can be traced back to the underlying measurements rather than accepted as opaque outputs.

Impact

RVQ-Alpha contributes to a fast-growing line of work adapting LLMs to single-cell biology, alongside efforts such as Cell2Sentence, by showing that compact discrete tokenization plus verifiable reinforcement learning can improve out-of-distribution generalization and sharply curb hallucination. Its emphasis on evidence-grounded reasoning and ontology-based reward verification offers a template for making LLM-based cell models more trustworthy. As of the April 2026 preprint, model weights have not been publicly released, which currently limits independent reproduction and downstream adoption.

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Li, G., et al. (2026) RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning. bioRxiv.

DOI: 10.64898/2026.04.20.719773

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations8

Influential0

References0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

4Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website

Key Features

Compact discrete cell tokenization: An eight-codebook residual quantizer (32 entries per codebook) encodes each cell in just 10 tokens, roughly 3.4x fewer than prior discrete tokenization methods, where earlier codebooks capture broad identity and later codebooks refine within-lineage variation.

Vocabulary-native cell tokens: Cell tokens are embedded directly in the LLM vocabulary rather than passed through a separate encoder, letting one autoregressive model both interpret and generate cell states.

Evidence-first reasoning (scCoT-Synth): A teacher-student engine grounds the newly added biological tokens through "evidence-before-conclusion" chain-of-thought, reducing unsupported claims.

Fact-Aware RLVR: A verifiable-reward reinforcement learning stage pairs an ontology-grounded answer judge with saliency-weighted verification of biological claims against the actual expression data.

Generative and discriminative in one model: The same checkpoint supports cell type annotation and autoregressive generation of post-perturbation cell states.

Technical Details

Applications

Impact

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Li, G., et al. (2026) RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning. bioRxiv.

DOI: 10.64898/2026.04.20.719773

RVQ-Alpha

Key Features

Technical Details

Applications

Impact

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

RVQ-Alpha

Key Features

Technical Details

Applications

Impact

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

RVQ-Alpha

#Key Features

#Technical Details

#Applications

#Impact

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

RVQ-Alpha

#Key Features

#Technical Details

#Applications

#Impact

Citation

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact