Deep learning model that predicts cell-type-specific gene expression from DNA sequence and Hi-C 3D chromatin organization, generalizing to unseen cell types and species without retraining.
Puget is a deep learning model that predicts cell-type-specific gene expression by combining DNA sequence with experimentally measured three-dimensional chromatin organization. Developed by Shu Hang, William Stafford Noble, and colleagues in the Noble Lab at the University of Washington and released as a bioRxiv preprint in November 2025, Puget addresses a persistent limitation of sequence-only expression predictors: because the input DNA sequence at a locus is essentially identical across cell types, models that rely on sequence alone struggle to explain why the same gene is expressed differently in different cellular contexts.
Gene regulation depends not only on the linear arrangement of regulatory elements but on how the genome folds in three dimensions, bringing distal enhancers into physical contact with their target promoters. Puget makes this 3D context explicit by feeding Hi-C contact maps — which measure genome-wide DNA–DNA proximity — into the model alongside sequence. This pairing lets Puget capture the cell-type-specific looping that determines which enhancers are active for a given gene, a signal that is invisible to sequence-only architectures such as Enformer and Borzoi.
Puget sits in the same regulatory-genomics niche as Enformer and Chromoformer but is distinguished by its use of a pretrained Hi-C encoder and its demonstrated ability to generalize to held-out cell types and across species without any re-fitting, behaving as a genuine pretrained foundation model rather than a per-cell-type regression.
Puget couples two pretrained encoders to a lightweight transformer decoder. One encoder processes DNA sequence; the other processes Hi-C contact matrices and is based on HiCFoundation, a Vision-Transformer masked autoencoder pretrained on hundreds of Hi-C assays. The pretrained encoders are held fixed, and a compact transformer decoder integrates their embeddings to produce cell-type-specific expression predictions. This design keeps the number of trainable parameters small and concentrates learning on the cross-modal integration step.
The model was trained on paired Hi-C and RNA-seq data from 36 human and 4 mouse biosamples. Evaluation tested three generalization regimes: held-out genes, held-out biosamples, and human-to-mouse transfer. Relative to a sequence-only baseline, Puget improves cross-biosample Pearson correlation by up to 25% on highly variable genes, and — unlike the sequence-only model — it generalizes to held-out biosamples and across species without retraining. Highly variable genes, which differ most across cell types, are precisely the cases where sequence-only models fail and where the Hi-C signal contributes most.
Puget is aimed at researchers in regulatory genomics, functional genomics, and gene-regulation modeling who need expression predictions that are sensitive to cellular context. Because it generalizes to held-out cell types from their Hi-C maps, it can impute expression for biosamples that have chromatin-conformation data but limited expression profiling, and its in silico perturbation capability lets investigators prioritize candidate enhancer-gene links for experimental follow-up. A practical constraint is that Puget requires Hi-C data as input in addition to sequence, so it is best suited to settings where 3D chromatin maps are already available rather than to purely sequence-driven, genome-wide screens.
Puget demonstrates that incorporating measured 3D genome organization, rather than relying on sequence alone, can meaningfully improve cell-type-specific expression prediction and enable generalization to unseen cell types and species — a long-standing weakness of sequence-only regulatory models. By framing the problem around frozen pretrained sequence and Hi-C encoders with a lightweight trained decoder, it offers a parameter-efficient template for multimodal regulatory modeling. As a November 2025 preprint, its downstream adoption is still emerging, and its reliance on Hi-C inputs narrows its applicability relative to sequence-only predictors. At the time of writing, no public code repository or model weights had been released by the authors.
Hang, S., et al. (2025) Puget predicts gene expression across cell types using sequence and 3D chromatin organization data. bioRxiv.
DOI: 10.1101/2025.11.19.689320Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data