genbio.ai
Single-cell RNA-seq foundation model series pretrained on 50 million human cells with full transcriptome-scale context for cell clustering and perturbation modeling.
AIDO.Cell is a series of single-cell RNA sequencing (scRNA-seq) foundation models developed by GenBio AI as the cellular-scale component of the AIDO (AI-Driven Digital Organism) multiscale platform. Released in November 2024 and presented at the NeurIPS 2024 Workshop on AI for New Drug Modalities, AIDO.Cell addresses a fundamental limitation shared by most prior single-cell foundation models: the inability to represent the full human transcriptome in a single forward pass without truncation or gene selection. By processing all expressed genes from the human transcriptome simultaneously within a long-context transformer architecture, AIDO.Cell learns cell representations that encode the complete transcriptional state of each cell rather than an arbitrarily sampled or truncated subset of gene expression values.
Single-cell RNA-seq data presents unique modeling challenges that distinguish it from other biological sequences. Each observation is a cell represented by expression measurements across up to approximately 20,000 protein-coding genes, where the measurements are count data subject to substantial technical noise from the sequencing process itself — particularly dropout events that cause expressed genes to appear silent, and variable sequencing depth that affects the total count per cell. Prior single-cell foundation models such as Geneformer, scGPT, and scFoundation addressed these challenges through various strategies including rank-value encoding, gene masking objectives, and cell-level normalization. A common limitation, however, was that the full transcriptome was too large to process in a single forward pass at practical batch sizes, requiring truncation to the most highly expressed genes or random gene sampling — both of which discard transcriptional information that may be biologically relevant.
AIDO.Cell resolves this through a combination of efficient architecture design and a novel auto-discretization strategy that converts continuous gene expression values to discrete tokens, enabling efficient attention computation across the full ~20,000-gene transcriptome. The model series spans four sizes — 3M, 10M, 100M, and 650M parameters — allowing systematic study of scaling behavior in single-cell foundation models. The flagship 100M variant achieves state-of-the-art results across zero-shot clustering, cell-type classification, and perturbation modeling benchmarks, while the 650M variant pushes the performance envelope further at the cost of increased compute requirements. AIDO.Cell is also featured on the Chan Zuckerberg Initiative's Virtual Cells Platform, reflecting broader recognition within the single-cell biology community.
AIDO.Cell uses a BERT-style encoder-only dense transformer architecture with modifications that align with current best practices from large language model development. The feed-forward layers use SwiGLU activation functions, and the normalization uses LayerNorm applied in the pre-normalization configuration. The auto-discretization module is a learned component that maps continuous gene expression values — which in scRNA-seq data span several orders of magnitude and are subject to overdispersion and dropout — to a discrete vocabulary of expression-level tokens. This discretization approach avoids the need to specify a fixed normalization scheme prior to model training and allows the model to learn an optimal binning of expression values jointly with the pretraining objective.
The pretraining objective is a read-depth-aware masked gene expression prediction task. A fraction of gene expression values in each cell's transcriptome are masked, and the model is trained to predict the identity (which gene) and expression level of each masked gene given the remaining unmasked transcriptional context. Critically, the masking and prediction is conditioned on the total read count of the cell, which allows the model to distinguish genuine low expression from technical dropout — a notoriously difficult problem in scRNA-seq analysis. Training was distributed across 256 H100 GPUs, requiring approximately three days for the 100M model and eight days for the 650M model.
The training corpus consists of 50 million human single-cell RNA-seq profiles drawn from diverse tissue sources, assembled from publicly available datasets with standardized processing. This corpus provides coverage of major human organ systems and cell type categories, including immune cells, epithelial cells, neuronal subtypes, stromal cells, and stem/progenitor populations. The breadth of tissue coverage is important for learning representations that generalize across cell type annotation and clustering tasks without bias toward any single tissue.
Benchmark evaluation covers three primary categories. Zero-shot clustering performance — evaluated by applying AIDO.Cell embeddings to cells without any task-specific fine-tuning and measuring the quality of the resulting clustering against known cell type labels — demonstrates that the model's pretrained representations already capture the major axes of transcriptional variation that define cell identity. Cell-type classification, evaluated with minimal fine-tuning, achieves state-of-the-art accuracy across multiple tissue benchmarks. Perturbation modeling — predicting the transcriptional response to genetic knockouts or drug treatments — reaches top benchmark performance, a particularly challenging task because it requires the model to generalize to interventional scenarios not seen during pretraining. The four model sizes were systematically compared to identify computational scaling asymptotes, with results showing continued improvement from 3M to 100M parameters and diminishing but still positive returns from 100M to 650M parameters.
The HuggingFace release includes all four model sizes (AIDO.Cell-3M, AIDO.Cell-10M, AIDO.Cell-100M) with pretrained weights, and the AIDO.ModelGenerator framework provides standardized fine-tuning workflows for downstream tasks including cell type classification, batch correction, trajectory inference, and perturbation prediction.
AIDO.Cell is relevant to the broad community of researchers working with single-cell transcriptomics data across biology, medicine, and drug discovery. Cell biologists and developmental biologists can apply the model's zero-shot clustering capability to automatically organize large single-cell atlases by transcriptional similarity without requiring predefined cell type labels — a critical step in interpreting data from novel tissues, disease states, or developmental time points not well represented in existing references. Clinical researchers can use AIDO.Cell for cell type annotation in patient samples, where the model's pretrained representations reduce the labeled training data required to accurately classify cell populations in disease-specific contexts. Pharmacologists and drug discovery teams benefit most directly from the perturbation modeling capability: by predicting transcriptional responses to genetic or chemical perturbations before conducting costly experiments, AIDO.Cell enables in silico prioritization of drug targets and mechanistic hypotheses. This is particularly valuable in oncology, where large-scale perturbation screens such as CRISPR-Cas9 dropout screens or transcriptomic drug response datasets can be used as reference to fine-tune the model's perturbation predictions. The inclusion of AIDO.Cell on the Chan Zuckerberg Initiative Virtual Cells Platform extends accessibility to the broader cell biology community through a standardized interface, enabling researchers without deep ML expertise to apply the model to their single-cell datasets.
AIDO.Cell represents a methodological advance for single-cell foundation models by demonstrating that full transcriptome-scale context is both computationally feasible and empirically beneficial. The demonstration that processing all ~20,000 genes simultaneously yields better representations than truncated or sampled inputs provides empirical justification for investing in the engineering required to handle long-context single-cell data — a result with implications for how future single-cell foundation models are designed. The read-depth-aware pretraining objective is a biologically motivated design choice that addresses a real confound in scRNA-seq data, and its effectiveness in improving benchmark performance suggests that domain knowledge about technical artifacts should be incorporated more broadly into single-cell model design. As a component of the AIDO multiscale platform, AIDO.Cell is intended to be coupled with AIDO.DNA, AIDO.RNA, and AIDO.Protein to enable cross-scale biological inference — for example, linking genetic variants at the DNA level to transcriptional consequences at the cellular level. The current model is limited to human cells, which is appropriate for biomedical applications but excludes researchers working with model organisms or non-human datasets. The pretraining corpus of 50 million cells, while large by the standards of 2024 single-cell model development, is smaller than the largest available public single-cell atlases, and future versions will likely benefit from expanded training data coverage.