bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & GeneSpatial omics

Evo2HiC

University of Washington

A multimodal foundation model that distills Evo 2 (7B) into a compact encoder guided by Hi-C data to predict cell-type-specific 3D genome architecture and epigenomic signals.

Released: November 2025

Evo2HiC is a multimodal foundation model for the integrative analysis of genome sequence and three-dimensional chromatin architecture, developed by Tangqi Fang, Sheng Wang, William Stafford Noble, and colleagues at the University of Washington and released as a bioRxiv preprint in November 2025. It addresses a central bottleneck in 3D genomics: the largest DNA foundation models, such as Evo 2 (7B parameters), capture rich sequence features but are too computationally expensive to apply routinely to chromatin-structure tasks, while specialized Hi-C predictors lack the broad genomic priors that large sequence models provide.

The model's core idea is knowledge distillation guided by structure. Evo2HiC distills the 7-billion-parameter Evo 2 model into a compact encoder, using Hi-C chromatin contact data to steer the distillation so that the features most relevant to 3D genome organization are preserved. The result is a lightweight encoder that retains the predictive power of a much larger sequence model while being efficient enough to run cell-type-specific prediction at scale.

By jointly representing one-dimensional sequence and two-dimensional contact information, Evo2HiC sits at the intersection of genomic language modeling and chromatin biology. It predicts both Hi-C contact maps and epigenomic profiles, and it generalizes across species in a zero-shot setting, positioning it as a general-purpose tool for studying how genome sequence encodes nuclear architecture.

#Key Features

  • Structure-guided distillation: Compresses the 7B-parameter Evo 2 model into a compact encoder, using Hi-C contact data to guide which sequence features are retained for 3D genome analysis, dramatically reducing computational cost.
  • Multimodal sequence-and-structure modeling: A 1D component predicts epigenomic profiles from DNA sequence while a 2D joint sequence-structure component predicts Hi-C contact matrices and performs contact-map resolution enhancement.
  • Cell-type-specific prediction: Resolves cell-type-specific 3D genome architecture and identifies sequence patterns that drive differences in chromatin organization between cell types.
  • Cross-species zero-shot generalization: Generalizes to chromatin architecture and epigenomic prediction across 177 species without species-specific retraining.
  • Hi-C retrieval module: A SigLIP-based embedding module supports retrieval over Hi-C data, linking sequence representations to matching structural contexts.
  • Open, reproducible release: Apache-2.0 inference code is available on GitHub and pretrained checkpoints are archived on Zenodo, so users can run prediction by loading checkpoints without any retraining.

#Technical Details

Evo2HiC builds on Evo 2, a genomic foundation model based on the StripedHyena 2 architecture that interleaves selective state-space layers with attention. Rather than running the full 7B model, Evo2HiC distills it into a compact encoder whose distillation objective is shaped by Hi-C contact data, preserving the long-range sequence dependencies that matter for chromatin folding. The architecture has three parts: a 1D DNA-sequence encoder for epigenomic profile prediction, a 2D joint sequence-structure encoder for Hi-C contact-matrix prediction and resolution enhancement, and a SigLIP-based retrieval module for Hi-C embeddings. On Hi-C prediction the model reports a 10.9% improvement in Spearman correlation over Orca, a leading sequence-to-Hi-C baseline, alongside state-of-the-art results across multiple chromatin-analysis tasks. Its cross-species evaluation spans 177 species, demonstrating that the distilled representations transfer beyond the training organisms.

#Applications

Evo2HiC is aimed at researchers studying genome organization and gene regulation. Functional and regulatory genomicists can predict cell-type-specific Hi-C contact maps and epigenomic signals directly from sequence, including for cell types or species where experimental Hi-C is unavailable or low-resolution. Its resolution-enhancement capability lets groups upsample sparse contact maps, and its cross-species generalization supports comparative genomics across the 177 species evaluated. Because it identifies cell-type-specific sequence patterns, it can also help interpret how non-coding variation reshapes chromatin architecture, complementing variant-effect workflows. The compact encoder makes these analyses tractable on modest hardware compared with running the full Evo 2 model.

#Impact

Evo2HiC demonstrates a practical recipe for transferring the capabilities of very large DNA foundation models into efficient, task-specialized tools: rather than scaling up, it distills down while using an orthogonal data modality (Hi-C) to retain the most relevant features. The reported 10.9% Spearman-correlation gain over Orca and broad cross-species generalization suggest that structure-guided distillation can outperform both standalone Hi-C predictors and naive use of large sequence models. As a recent preprint, its benchmarks await peer review and independent replication, and the distillation approach is tied to the availability and quality of Hi-C training data. Nonetheless, by pairing an open Apache-2.0 codebase with archived checkpoints, Evo2HiC offers the chromatin-biology community an accessible foundation model that bridges genomic language modeling and 3D genome analysis.

Citation

Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture

Preprint

Fang, T., et al. (2025) Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture. bioRxiv.

DOI: 10.1101/2025.11.18.689171

Recent citations

Papers that recently cited this model.

  • ContextTAD: Context-aware boundary learning for TAD calling from Hi-C contact maps

    Weicai Long, Yusen Hou, Yanlin Zhang

    bioRxiv · May 2026

    0
  • Pitfalls and missing links in current understanding of 4D genomes

    Michael Q. Zhang

    Quantitative Biology · Mar 2026

    1

Top citations

The most-cited papers that cite this model.

  • Pitfalls and missing links in current understanding of 4D genomes

    Michael Q. Zhang

    Quantitative Biology · Mar 2026

    1
  • ContextTAD: Context-aware boundary learning for TAD calling from Hi-C contact maps

    Weicai Long, Yusen Hou, Yanlin Zhang

    bioRxiv · May 2026

    0

Citations

Total Citations2
Influential0
References45

GitHub

Stars9
Forks3
Open Issues1
Contributors1
Last Push6mo ago
LanguagePython
LicenseApache-2.0

Fields of citing research

  • Biology100%
  • Computer Science50%
  • Environmental Science50%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
57Partial
Usability — can I run it?83
Reproducibility — can I retrain it?41
Model Openness Framework
Unclassified
Restrictive license on core components

Resources

GitHub RepositorybioRxiv PreprintDataset