bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

seq2cells

GSK.ai

Transfer learning framework that predicts single-cell gene expression from ~200kb DNA sequences using Enformer embeddings and a lightweight MLP.

Released: 2023

Overview

seq2cells is a transfer learning framework developed by GSK.ai that predicts gene expression at single-cell resolution directly from DNA sequence. Rather than training a model from scratch on genomic sequence, seq2cells builds on Enformer — a deep learning model pre-trained on bulk epigenomic and transcriptomic data across a ~200 kilobase context window — and adapts its learned representations to resolve expression differences between individual cells. This approach addresses a fundamental limitation of earlier sequence-to-expression models: their reliance on aggregated, bulk measurements that obscure the cell-type-specific regulatory logic encoded in the genome.

The framework was motivated by the observation that genetic variants associated with complex disease typically act through gene expression effects that are specific to particular cell types or activation states. Understanding which variants alter expression, and in which cells, requires models capable of operating at single-cell rather than tissue-average resolution. seq2cells provides a computationally tractable route to this goal by combining a large, expressive DNA encoder with a lightweight cell-state-specific predictor, making it feasible to apply to datasets containing hundreds of thousands of cells.

The preprint was posted to bioRxiv in July 2023 by Ron Schwessinger, Jacob Deasy, Rob T. Woodruff, Stephen Young, and Kim M. Branson, all at GSK.ai.

Key Features

  • Large-context DNA encoding: Accepts approximately 200 kilobases of sequence centered on the transcription start site (TSS) of a gene, allowing the model to capture long-range cis-regulatory elements such as distal enhancers that bulk epigenomic models typically miss at shorter context lengths.
  • Modular two-component design: The seq2emb module extracts DNA sequence embeddings from the frozen Enformer trunk, and the emb2cell module — a lightweight two-layer MLP — is trained to map those embeddings to single-cell expression predictions, enabling fast adaptation to new single-cell datasets without retraining the expensive DNA encoder.
  • In silico variant effect prediction: By substituting reference alleles with alternate alleles in the input sequence, seq2cells predicts how single nucleotide variants (SNVs) alter expression across individual cells, revealing regulatory heterogeneity within broadly defined cell type annotations.
  • Cross-population variant transfer: The framework supports in silico transfer of predicted variant effects between cell populations, enabling researchers to reason about how a variant characterized in one tissue or activation state might act in another without additional experimental data.
  • Scalability to large single-cell datasets: Demonstrated on a CD4 T cell activation dataset comprising approximately 650,000 cells, establishing practical scalability to the dataset sizes typical in modern single-cell atlases.

Technical Details

seq2cells is implemented as a two-stage pipeline. The first stage, seq2emb, passes a ~200 kb genomic window centered on a gene's canonical TSS (Gencode V41, hg38 reference) through the pre-trained Enformer trunk, producing a fixed-dimensional sequence embedding. Enformer itself is a deep convolutional and transformer model pre-trained to predict hundreds of epigenomic and transcriptomic tracks from bulk assays. In seq2cells, the Enformer weights are held frozen, and only the second module — emb2cell, a two-layer MLP — is trained on single-cell expression data provided in AnnData format. Training uses early stopping with a patience of 5 epochs and a maximum of 30 epochs.

The model was validated on T cell developmental atlases: a hematopoietic stem cell-focused subset of approximately 30,000 cells, a full T cell development dataset of approximately 250,000 cells, and a CD4 T cell activation dataset of approximately 650,000 cells. Evaluation against held-out genes yielded a cross-gene Pearson correlation of 0.762 and a cross-cell Pearson correlation of 0.285. The gap between these two metrics reflects the intrinsic difficulty of resolving between-cell variation from sequence alone, as much of that variation arises from post-transcriptional and environmental factors not encoded in the genome. Subsequent work (scooby, Nature Methods 2025) that extends the approach to multimodal single-cell profiles reported improved cross-gene correlations of up to 0.87 on shared test genes, providing a useful reference point for seq2cells' performance.

Applications

seq2cells is designed for researchers working at the intersection of functional genomics, single-cell biology, and human genetics. Computational biologists can use it to prioritize and interpret non-coding variants from GWAS studies by predicting their expression consequences at cell-type resolution. Immunologists and cell biologists studying heterogeneous tissues can use the framework to understand which regulatory programs are driven by DNA sequence versus environmental or epigenetic factors. Pharmaceutical researchers can apply variant effect predictions to link disease-associated polymorphisms to specific cell states, informing target identification. Model weights and precomputed Enformer embeddings are available from Zenodo, allowing researchers to skip the computationally expensive embedding step and fine-tune only the MLP on their own single-cell datasets.

Impact

seq2cells demonstrated that the large-context sequence representations learned by bulk epigenomic models such as Enformer contain sufficient information to resolve gene expression differences at single-cell resolution, a non-obvious result that validated the transfer learning strategy for this domain. As a preprint from an industry lab at GSK.ai, it contributed to a growing literature on sequence-to-expression modeling and helped establish single-cell resolution as a tractable prediction target. The work has directly influenced subsequent methods: the scooby model (Nature Methods, 2025), which extends the framework to jointly predict chromatin accessibility and gene expression in a multimodal single-cell setting, cited seq2cells as a baseline and demonstrated substantial improvements. A key limitation of seq2cells is that its cross-cell correlation (0.285) remains low, reflecting the reality that cell-to-cell gene expression variation is only partially encoded in DNA sequence — the rest is shaped by signaling states, chromatin dynamics, and stochastic factors that lie beyond the scope of a sequence-only model.

Citation

Single-cell gene expression prediction from DNA sequence at large contexts

Preprint

Schwessinger, R., et al. (2023) Single-cell gene expression prediction from DNA sequence at large contexts. bioRxiv.

DOI: 10.1101/2023.07.26.550634

Metrics

GitHub

Stars12
Forks2
Open Issues0
Contributors2
Last Push2y ago
LanguagePython
LicenseApache-2.0

Citations

Total Citations15
Influential1
References56

Tags

gene expressionvariant effect predictionfoundation model

Resources

GitHub RepositoryResearch PaperDataset