seq2cells

Transfer learning framework that predicts single-cell gene expression from ~200kb DNA sequences using Enformer embeddings and a lightweight MLP.

Released: July 2023

seq2cells is a transfer learning framework developed by GSK.ai that predicts gene expression at single-cell resolution directly from DNA sequence. Rather than training a model from scratch on genomic sequence, seq2cells builds on Enformer — a deep learning model pre-trained on bulk epigenomic and transcriptomic data across a ~200 kilobase context window — and adapts its learned representations to resolve expression differences between individual cells. This approach addresses a fundamental limitation of earlier sequence-to-expression models: their reliance on aggregated, bulk measurements that obscure the cell-type-specific regulatory logic encoded in the genome.

The framework was motivated by the observation that genetic variants associated with complex disease typically act through gene expression effects that are specific to particular cell types or activation states. Understanding which variants alter expression, and in which cells, requires models capable of operating at single-cell rather than tissue-average resolution. seq2cells provides a computationally tractable route to this goal by combining a large, expressive DNA encoder with a lightweight cell-state-specific predictor, making it feasible to apply to datasets containing hundreds of thousands of cells.

The preprint was posted to bioRxiv in July 2023 by Ron Schwessinger, Jacob Deasy, Rob T. Woodruff, Stephen Young, and Kim M. Branson, all at GSK.ai.

Key Features

Large-context DNA encoding: Accepts approximately 200 kilobases of sequence centered on the transcription start site (TSS) of a gene, allowing the model to capture long-range cis-regulatory elements such as distal enhancers that bulk epigenomic models typically miss at shorter context lengths.
Modular two-component design: The seq2emb module extracts DNA sequence embeddings from the frozen Enformer trunk, and the emb2cell module — a lightweight two-layer MLP — is trained to map those embeddings to single-cell expression predictions, enabling fast adaptation to new single-cell datasets without retraining the expensive DNA encoder.
In silico variant effect prediction: By substituting reference alleles with alternate alleles in the input sequence, seq2cells predicts how single nucleotide variants (SNVs) alter expression across individual cells, revealing regulatory heterogeneity within broadly defined cell type annotations.
Cross-population variant transfer: The framework supports in silico transfer of predicted variant effects between cell populations, enabling researchers to reason about how a variant characterized in one tissue or activation state might act in another without additional experimental data.
Scalability to large single-cell datasets: Demonstrated on a CD4 T cell activation dataset comprising approximately 650,000 cells, establishing practical scalability to the dataset sizes typical in modern single-cell atlases.

Technical Details

seq2cells is implemented as a two-stage pipeline. The first stage, seq2emb, passes a ~200 kb genomic window centered on a gene's canonical TSS (Gencode V41, hg38 reference) through the pre-trained Enformer trunk, producing a fixed-dimensional sequence embedding. Enformer itself is a deep convolutional and transformer model pre-trained to predict hundreds of epigenomic and transcriptomic tracks from bulk assays. In seq2cells, the Enformer weights are held frozen, and only the second module — emb2cell, a two-layer MLP — is trained on single-cell expression data provided in AnnData format. Training uses early stopping with a patience of 5 epochs and a maximum of 30 epochs.

The model was validated on T cell developmental atlases: a hematopoietic stem cell-focused subset of approximately 30,000 cells, a full T cell development dataset of approximately 250,000 cells, and a CD4 T cell activation dataset of approximately 650,000 cells. Evaluation against held-out genes yielded a cross-gene Pearson correlation of 0.762 and a cross-cell Pearson correlation of 0.285. The gap between these two metrics reflects the intrinsic difficulty of resolving between-cell variation from sequence alone, as much of that variation arises from post-transcriptional and environmental factors not encoded in the genome. Subsequent work (scooby, Nature Methods 2025) that extends the approach to multimodal single-cell profiles reported improved cross-gene correlations of up to 0.87 on shared test genes, providing a useful reference point for seq2cells' performance.

Applications

seq2cells is designed for researchers working at the intersection of functional genomics, single-cell biology, and human genetics. Computational biologists can use it to prioritize and interpret non-coding variants from GWAS studies by predicting their expression consequences at cell-type resolution. Immunologists and cell biologists studying heterogeneous tissues can use the framework to understand which regulatory programs are driven by DNA sequence versus environmental or epigenetic factors. Pharmaceutical researchers can apply variant effect predictions to link disease-associated polymorphisms to specific cell states, informing target identification. Model weights and precomputed Enformer embeddings are available from Zenodo, allowing researchers to skip the computationally expensive embedding step and fine-tune only the MLP on their own single-cell datasets.

Impact

seq2cells demonstrated that the large-context sequence representations learned by bulk epigenomic models such as Enformer contain sufficient information to resolve gene expression differences at single-cell resolution, a non-obvious result that validated the transfer learning strategy for this domain. As a preprint from an industry lab at GSK.ai, it contributed to a growing literature on sequence-to-expression modeling and helped establish single-cell resolution as a tractable prediction target. The work has directly influenced subsequent methods: the scooby model (Nature Methods, 2025), which extends the framework to jointly predict chromatin accessibility and gene expression in a multimodal single-cell setting, cited seq2cells as a baseline and demonstrated substantial improvements. A key limitation of seq2cells is that its cross-cell correlation (0.285) remains low, reflecting the reality that cell-to-cell gene expression variation is only partially encoded in DNA sequence — the rest is shaped by signaling states, chromatin dynamics, and stochastic factors that lie beyond the scope of a sequence-only model.

Citation

Single-cell gene expression prediction from DNA sequence at large contexts

Preprint

Schwessinger, R., et al. (2023) Single-cell gene expression prediction from DNA sequence at large contexts. bioRxiv.

DOI: 10.1101/2023.07.26.550634

Recent citations

Papers that recently cited this model.

CREsted: modeling genomic and synthetic cell-type-specific enhancers across tissues and species
Niklas Kempynck, S. De Winter, Casper H. Blaauw, et al.
Nature Methods · Apr 2026
0
Decoding exon inclusion in the human brain reveals more divergent splicing mechanisms in neurons than glia
Lieke Michielsen, Justine Hsu, Anoushka Joglekar, et al.
Genome Biology · Feb 2026
1
Parameter-efficient fine-tuning enables scalable transfer of regulatory sequence models to novel contexts
Han Yuan, Johannes Linder, David R. Kelley
Genome Biology · Jan 2026
1

Top citations

The most-cited papers that cite this model.

Predicting gene expression from DNA sequence using deep learning models
Lucía Barbadilla-Martínez, Noud Klaassen, B. van Steensel, et al.
Nature reviews genetics · May 2025
56
Deciphering cell types by integrating scATAC-seq data with genome sequences
Yuansong Zeng, Mai Luo, Ningyuan Shangguan, et al.
Nature Computational Science · Apr 2024
27
Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects
Xiao-Yong Wang, Fuyi Li, Yiwen Zhang, et al.
Briefings Bioinform. · Jul 2024
22
scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution
Johannes C. Hingerl, Laura D. Martens, Alexander Karollus, et al.
bioRxiv · Sep 2024
11
scooby: modeling multimodal genomic profiles from DNA sequence at single-cell resolution
Johannes C. Hingerl, Laura D. Martens, Alexander Karollus, et al.
Nature Methods · Oct 2025
10

Citations

Total Citations18

Influential1

References56

GitHub

Stars12

Forks2

Open Issues0

Contributors2

Last Push2y ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Biology100%
Computer Science94%
Medicine76%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

64Partial

Usability — can I run it?70

Reproducibility — can I retrain it?55

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Dataset

Key Features

Large-context DNA encoding: Accepts approximately 200 kilobases of sequence centered on the transcription start site (TSS) of a gene, allowing the model to capture long-range cis-regulatory elements such as distal enhancers that bulk epigenomic models typically miss at shorter context lengths.

Modular two-component design: The seq2emb module extracts DNA sequence embeddings from the frozen Enformer trunk, and the emb2cell module — a lightweight two-layer MLP — is trained to map those embeddings to single-cell expression predictions, enabling fast adaptation to new single-cell datasets without retraining the expensive DNA encoder.

In silico variant effect prediction: By substituting reference alleles with alternate alleles in the input sequence, seq2cells predicts how single nucleotide variants (SNVs) alter expression across individual cells, revealing regulatory heterogeneity within broadly defined cell type annotations.

Cross-population variant transfer: The framework supports in silico transfer of predicted variant effects between cell populations, enabling researchers to reason about how a variant characterized in one tissue or activation state might act in another without additional experimental data.

Scalability to large single-cell datasets: Demonstrated on a CD4 T cell activation dataset comprising approximately 650,000 cells, establishing practical scalability to the dataset sizes typical in modern single-cell atlases.

Technical Details

Applications

Impact

seq2cells

#Key Features

#Technical Details

#Applications

#Impact

Citation

Single-cell gene expression prediction from DNA sequence at large contexts

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

seq2cells

#Key Features

#Technical Details

#Applications

#Impact

Citation

Single-cell gene expression prediction from DNA sequence at large contexts

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact