Genomic foundation model trained on 9.3 trillion DNA base pairs spanning all domains of life, with 40B parameters and a 1-million-token context window.
Evo 2 is a biological foundation model developed by the Arc Institute and Stanford University that scales genomic sequence modeling to an unprecedented size and scope. Trained on 9.3 trillion DNA base pairs drawn from the OpenGenome2 dataset — a curated atlas spanning bacteria, archaea, and eukaryotes — Evo 2 is offered at two scales: a 7-billion and a 40-billion parameter variant. Both versions operate at single-nucleotide resolution with a 1-million-token context window, allowing the model to reason over entire chromosomal regions in a single forward pass.
The original Evo model (2024) demonstrated that a single sequence model could capture biology across molecular to genome scales. Evo 2 extends this vision dramatically: more parameters, far more training data from a broader phylogenetic range, and new capabilities in zero-shot variant effect prediction and controllable genome-scale generation. The model is trained entirely on raw DNA sequence without labels or task-specific supervision, yet it spontaneously learns a rich set of biological features detectable by mechanistic interpretability analyses — including exon-intron boundaries, transcription factor binding sites, protein secondary structure elements, and prophage integration sites.
Evo 2 is fully open: model weights, training code, inference code, and the OpenGenome2 training dataset are all publicly released, making it one of the largest and most transparent genomic foundation models available to the research community.
Evo 2 is built on the StripedHyena 2 architecture, an extension of the original Evo's hybrid architecture that combines selective state-space layers (similar to Mamba) with attention layers in an interleaved pattern. This hybrid design allows efficient processing of very long sequences — the 1-million-token context window would be computationally prohibitive for a standard transformer — while retaining the ability to capture long-range dependencies that recurrent-only models can miss. The 40B parameter model is among the largest sequence-level genomic models published to date.
Training used the OpenGenome2 dataset, a highly curated collection assembled from public genome databases and spanning the full tree of life: viruses, bacteria, archaea, fungi, plants, and animals. The 9.3 trillion base pairs in this corpus represent a roughly 10-fold expansion in training data scale compared to its predecessor. Mechanistic interpretability analyses reveal that internal model representations spontaneously encode biologically meaningful features — transcription factor binding motifs, splice site patterns, protein structural elements — without explicit supervision. On benchmark tasks including zero-shot prediction of the functional impact of BRCA1 clinical variants, Evo 2 achieves competitive or superior performance to supervised models trained specifically for those tasks.
Evo 2 is broadly applicable to any task that benefits from a deep, sequence-level understanding of genomic DNA. Clinical genetics and functional genomics researchers can use its zero-shot variant scoring to prioritize variants of uncertain significance in genes such as BRCA1 without needing labeled training data. Synthetic biology teams can use the generative capabilities to design novel regulatory sequences, promoter elements, or entire microbial genomes with specified properties. The epigenomic structure generation workflow opens new directions for designing chromatin accessibility and nucleosome positioning patterns in eukaryotic contexts. Evolutionary biologists and comparative genomicists benefit from a model trained across the full phylogenetic range — allowing cross-species analyses in a unified embedding space.
Evo 2 represents a significant advance in the scale and capability of genomic foundation models, and its fully open release distinguishes it from many contemporaneous large-scale biological models. The combination of a 40B parameter architecture, 9.3-trillion base-pair training corpus, and 1-million-token context window sets a new benchmark for what sequence-level genomic models can achieve. The demonstrated inference-time scaling result — where additional compute at inference improves the biological quality of generated sequences — introduces a paradigm borrowed from language model research into genomics for the first time. A notable current limitation is that Evo 2 models DNA sequence only; it does not natively integrate RNA-seq, chromatin accessibility, or protein structural data, meaning that multi-modal genomic analyses still require separate tools. Nonetheless, Evo 2's open release and broad capability profile position it as a foundational resource for the next generation of computational genomics research.
Brixi, G., et al. (2025) Genome modeling and design across all domains of life with Evo 2. bioRxiv.
DOI: 10.1101/2025.02.18.638918