Evo2HiC

A multimodal foundation model that distills Evo 2 (7B) into a compact encoder guided by Hi-C data to predict cell-type-specific 3D genome architecture and epigenomic signals.

Released: November 2025

Evo2HiC is a multimodal foundation model for the integrative analysis of genome sequence and three-dimensional chromatin architecture, developed by Tangqi Fang, Sheng Wang, William Stafford Noble, and colleagues at the University of Washington and released as a bioRxiv preprint in November 2025. It addresses a central bottleneck in 3D genomics: the largest DNA foundation models, such as Evo 2 (7B parameters), capture rich sequence features but are too computationally expensive to apply routinely to chromatin-structure tasks, while specialized Hi-C predictors lack the broad genomic priors that large sequence models provide.

The model's core idea is knowledge distillation guided by structure. Evo2HiC distills the 7-billion-parameter Evo 2 model into a compact encoder, using Hi-C chromatin contact data to steer the distillation so that the features most relevant to 3D genome organization are preserved. The result is a lightweight encoder that retains the predictive power of a much larger sequence model while being efficient enough to run cell-type-specific prediction at scale.

By jointly representing one-dimensional sequence and two-dimensional contact information, Evo2HiC sits at the intersection of genomic language modeling and chromatin biology. It predicts both Hi-C contact maps and epigenomic profiles, and it generalizes across species in a zero-shot setting, positioning it as a general-purpose tool for studying how genome sequence encodes nuclear architecture.

Key Features

Structure-guided distillation: Compresses the 7B-parameter Evo 2 model into a compact encoder, using Hi-C contact data to guide which sequence features are retained for 3D genome analysis, dramatically reducing computational cost.
Multimodal sequence-and-structure modeling: A 1D component predicts epigenomic profiles from DNA sequence while a 2D joint sequence-structure component predicts Hi-C contact matrices and performs contact-map resolution enhancement.
Cell-type-specific prediction: Resolves cell-type-specific 3D genome architecture and identifies sequence patterns that drive differences in chromatin organization between cell types.
Cross-species zero-shot generalization: Generalizes to chromatin architecture and epigenomic prediction across 177 species without species-specific retraining.
Hi-C retrieval module: A SigLIP-based embedding module supports retrieval over Hi-C data, linking sequence representations to matching structural contexts.
Open, reproducible release: Apache-2.0 inference code is available on GitHub and pretrained checkpoints are archived on Zenodo, so users can run prediction by loading checkpoints without any retraining.

Technical Details

Evo2HiC builds on Evo 2, a genomic foundation model based on the StripedHyena 2 architecture that interleaves selective state-space layers with attention. Rather than running the full 7B model, Evo2HiC distills it into a compact encoder whose distillation objective is shaped by Hi-C contact data, preserving the long-range sequence dependencies that matter for chromatin folding. The architecture has three parts: a 1D DNA-sequence encoder for epigenomic profile prediction, a 2D joint sequence-structure encoder for Hi-C contact-matrix prediction and resolution enhancement, and a SigLIP-based retrieval module for Hi-C embeddings. On Hi-C prediction the model reports a 10.9% improvement in Spearman correlation over Orca, a leading sequence-to-Hi-C baseline, alongside state-of-the-art results across multiple chromatin-analysis tasks. Its cross-species evaluation spans 177 species, demonstrating that the distilled representations transfer beyond the training organisms.

Applications

Evo2HiC is aimed at researchers studying genome organization and gene regulation. Functional and regulatory genomicists can predict cell-type-specific Hi-C contact maps and epigenomic signals directly from sequence, including for cell types or species where experimental Hi-C is unavailable or low-resolution. Its resolution-enhancement capability lets groups upsample sparse contact maps, and its cross-species generalization supports comparative genomics across the 177 species evaluated. Because it identifies cell-type-specific sequence patterns, it can also help interpret how non-coding variation reshapes chromatin architecture, complementing variant-effect workflows. The compact encoder makes these analyses tractable on modest hardware compared with running the full Evo 2 model.

Impact

Evo2HiC demonstrates a practical recipe for transferring the capabilities of very large DNA foundation models into efficient, task-specialized tools: rather than scaling up, it distills down while using an orthogonal data modality (Hi-C) to retain the most relevant features. The reported 10.9% Spearman-correlation gain over Orca and broad cross-species generalization suggest that structure-guided distillation can outperform both standalone Hi-C predictors and naive use of large sequence models. As a recent preprint, its benchmarks await peer review and independent replication, and the distillation approach is tied to the availability and quality of Hi-C training data. Nonetheless, by pairing an open Apache-2.0 codebase with archived checkpoints, Evo2HiC offers the chromatin-biology community an accessible foundation model that bridges genomic language modeling and 3D genome analysis.

Citation

Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture

Preprint

Fang, T., et al. (2025) Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture. bioRxiv.

DOI: 10.1101/2025.11.18.689171

Recent citations

Papers that recently cited this model.

ContextTAD: Context-aware boundary learning for TAD calling from Hi-C contact maps
Weicai Long, Yusen Hou, Yanlin Zhang
bioRxiv · May 2026
0
Pitfalls and missing links in current understanding of 4D genomes
Michael Q. Zhang
Quantitative Biology · Mar 2026
1

Top citations

The most-cited papers that cite this model.

Pitfalls and missing links in current understanding of 4D genomes
Michael Q. Zhang
Quantitative Biology · Mar 2026
1
ContextTAD: Context-aware boundary learning for TAD calling from Hi-C contact maps
Weicai Long, Yusen Hou, Yanlin Zhang
bioRxiv · May 2026
0

Citations

Total Citations2

Influential0

References45

GitHub

Stars9

Forks3

Open Issues1

Contributors1

Last Push6mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Biology100%
Computer Science50%
Environmental Science50%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

57Partial

Usability — can I run it?83

Reproducibility — can I retrain it?41

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository bioRxiv Preprint Dataset

Key Features

Structure-guided distillation: Compresses the 7B-parameter Evo 2 model into a compact encoder, using Hi-C contact data to guide which sequence features are retained for 3D genome analysis, dramatically reducing computational cost.

Multimodal sequence-and-structure modeling: A 1D component predicts epigenomic profiles from DNA sequence while a 2D joint sequence-structure component predicts Hi-C contact matrices and performs contact-map resolution enhancement.

Cell-type-specific prediction: Resolves cell-type-specific 3D genome architecture and identifies sequence patterns that drive differences in chromatin organization between cell types.

Cross-species zero-shot generalization: Generalizes to chromatin architecture and epigenomic prediction across 177 species without species-specific retraining.

Hi-C retrieval module: A SigLIP-based embedding module supports retrieval over Hi-C data, linking sequence representations to matching structural contexts.

Open, reproducible release: Apache-2.0 inference code is available on GitHub and pretrained checkpoints are archived on Zenodo, so users can run prediction by loading checkpoints without any retraining.

Technical Details

Applications

Impact

Evo2HiC

#Key Features

#Technical Details

#Applications

#Impact

Citation

Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Resources

Evo2HiC

#Key Features

#Technical Details

#Applications

#Impact

Citation

Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact