Transfer learning framework that predicts single-cell gene expression from ~200kb DNA sequences using Enformer embeddings and a lightweight MLP.
seq2cells is a transfer learning framework developed by GSK.ai that predicts gene expression at single-cell resolution directly from DNA sequence. Rather than training a model from scratch on genomic sequence, seq2cells builds on Enformer — a deep learning model pre-trained on bulk epigenomic and transcriptomic data across a ~200 kilobase context window — and adapts its learned representations to resolve expression differences between individual cells. This approach addresses a fundamental limitation of earlier sequence-to-expression models: their reliance on aggregated, bulk measurements that obscure the cell-type-specific regulatory logic encoded in the genome.
The framework was motivated by the observation that genetic variants associated with complex disease typically act through gene expression effects that are specific to particular cell types or activation states. Understanding which variants alter expression, and in which cells, requires models capable of operating at single-cell rather than tissue-average resolution. seq2cells provides a computationally tractable route to this goal by combining a large, expressive DNA encoder with a lightweight cell-state-specific predictor, making it feasible to apply to datasets containing hundreds of thousands of cells.
The preprint was posted to bioRxiv in July 2023 by Ron Schwessinger, Jacob Deasy, Rob T. Woodruff, Stephen Young, and Kim M. Branson, all at GSK.ai.
seq2cells is implemented as a two-stage pipeline. The first stage, seq2emb, passes a ~200 kb genomic window centered on a gene's canonical TSS (Gencode V41, hg38 reference) through the pre-trained Enformer trunk, producing a fixed-dimensional sequence embedding. Enformer itself is a deep convolutional and transformer model pre-trained to predict hundreds of epigenomic and transcriptomic tracks from bulk assays. In seq2cells, the Enformer weights are held frozen, and only the second module — emb2cell, a two-layer MLP — is trained on single-cell expression data provided in AnnData format. Training uses early stopping with a patience of 5 epochs and a maximum of 30 epochs.
The model was validated on T cell developmental atlases: a hematopoietic stem cell-focused subset of approximately 30,000 cells, a full T cell development dataset of approximately 250,000 cells, and a CD4 T cell activation dataset of approximately 650,000 cells. Evaluation against held-out genes yielded a cross-gene Pearson correlation of 0.762 and a cross-cell Pearson correlation of 0.285. The gap between these two metrics reflects the intrinsic difficulty of resolving between-cell variation from sequence alone, as much of that variation arises from post-transcriptional and environmental factors not encoded in the genome. Subsequent work (scooby, Nature Methods 2025) that extends the approach to multimodal single-cell profiles reported improved cross-gene correlations of up to 0.87 on shared test genes, providing a useful reference point for seq2cells' performance.
seq2cells is designed for researchers working at the intersection of functional genomics, single-cell biology, and human genetics. Computational biologists can use it to prioritize and interpret non-coding variants from GWAS studies by predicting their expression consequences at cell-type resolution. Immunologists and cell biologists studying heterogeneous tissues can use the framework to understand which regulatory programs are driven by DNA sequence versus environmental or epigenetic factors. Pharmaceutical researchers can apply variant effect predictions to link disease-associated polymorphisms to specific cell states, informing target identification. Model weights and precomputed Enformer embeddings are available from Zenodo, allowing researchers to skip the computationally expensive embedding step and fine-tune only the MLP on their own single-cell datasets.
seq2cells demonstrated that the large-context sequence representations learned by bulk epigenomic models such as Enformer contain sufficient information to resolve gene expression differences at single-cell resolution, a non-obvious result that validated the transfer learning strategy for this domain. As a preprint from an industry lab at GSK.ai, it contributed to a growing literature on sequence-to-expression modeling and helped establish single-cell resolution as a tractable prediction target. The work has directly influenced subsequent methods: the scooby model (Nature Methods, 2025), which extends the framework to jointly predict chromatin accessibility and gene expression in a multimodal single-cell setting, cited seq2cells as a baseline and demonstrated substantial improvements. A key limitation of seq2cells is that its cross-cell correlation (0.285) remains low, reflecting the reality that cell-to-cell gene expression variation is only partially encoded in DNA sequence — the rest is shaped by signaling states, chromatin dynamics, and stochastic factors that lie beyond the scope of a sequence-only model.
Schwessinger, R., et al. (2023) Single-cell gene expression prediction from DNA sequence at large contexts. bioRxiv.
DOI: 10.1101/2023.07.26.550634