bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

Evo 2

Arc Institute

Genomic foundation model trained on 9.3 trillion DNA base pairs spanning all domains of life, with 40B parameters and a 1-million-token context window.

Released: 2025
Parameters: 40,000,000,000

Overview

Evo 2 is a biological foundation model developed by the Arc Institute and Stanford University that scales genomic sequence modeling to an unprecedented size and scope. Trained on 9.3 trillion DNA base pairs drawn from the OpenGenome2 dataset — a curated atlas spanning bacteria, archaea, and eukaryotes — Evo 2 is offered at two scales: a 7-billion and a 40-billion parameter variant. Both versions operate at single-nucleotide resolution with a 1-million-token context window, allowing the model to reason over entire chromosomal regions in a single forward pass.

The original Evo model (2024) demonstrated that a single sequence model could capture biology across molecular to genome scales. Evo 2 extends this vision dramatically: more parameters, far more training data from a broader phylogenetic range, and new capabilities in zero-shot variant effect prediction and controllable genome-scale generation. The model is trained entirely on raw DNA sequence without labels or task-specific supervision, yet it spontaneously learns a rich set of biological features detectable by mechanistic interpretability analyses — including exon-intron boundaries, transcription factor binding sites, protein secondary structure elements, and prophage integration sites.

Evo 2 is fully open: model weights, training code, inference code, and the OpenGenome2 training dataset are all publicly released, making it one of the largest and most transparent genomic foundation models available to the research community.

Key Features

  • Massive scale across all domains of life: Trained on 9.3 trillion base pairs from bacteria, archaea, and eukaryotes via the curated OpenGenome2 dataset, giving Evo 2 a breadth of genomic context that no prior model has matched.
  • 1-million-token single-nucleotide context window: The model processes up to one million nucleotides at full resolution in a single pass, enabling reasoning over large genomic regions including regulatory landscapes and multi-gene loci.
  • Zero-shot variant effect prediction: Without any task-specific fine-tuning, Evo 2 accurately predicts the functional consequences of genetic variants — from clinically significant BRCA1 missense mutations to noncoding pathogenic variants — outperforming many supervised approaches.
  • Genome-scale generative design: Evo 2 generates de novo mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater sequence naturalness and biological coherence than prior generative methods.
  • Inference-time scaling for epigenomics: Guided by inference-time search, Evo 2 enables controllable generation of sequences with specified epigenomic structures — the first demonstration of inference-time scaling applied to a biological sequence model.
  • Fully open release: Model weights (7B and 40B), training code, inference code, and the OpenGenome2 dataset are all publicly available, enabling broad community access and reproducibility.

Technical Details

Evo 2 is built on the StripedHyena 2 architecture, an extension of the original Evo's hybrid architecture that combines selective state-space layers (similar to Mamba) with attention layers in an interleaved pattern. This hybrid design allows efficient processing of very long sequences — the 1-million-token context window would be computationally prohibitive for a standard transformer — while retaining the ability to capture long-range dependencies that recurrent-only models can miss. The 40B parameter model is among the largest sequence-level genomic models published to date.

Training used the OpenGenome2 dataset, a highly curated collection assembled from public genome databases and spanning the full tree of life: viruses, bacteria, archaea, fungi, plants, and animals. The 9.3 trillion base pairs in this corpus represent a roughly 10-fold expansion in training data scale compared to its predecessor. Mechanistic interpretability analyses reveal that internal model representations spontaneously encode biologically meaningful features — transcription factor binding motifs, splice site patterns, protein structural elements — without explicit supervision. On benchmark tasks including zero-shot prediction of the functional impact of BRCA1 clinical variants, Evo 2 achieves competitive or superior performance to supervised models trained specifically for those tasks.

Applications

Evo 2 is broadly applicable to any task that benefits from a deep, sequence-level understanding of genomic DNA. Clinical genetics and functional genomics researchers can use its zero-shot variant scoring to prioritize variants of uncertain significance in genes such as BRCA1 without needing labeled training data. Synthetic biology teams can use the generative capabilities to design novel regulatory sequences, promoter elements, or entire microbial genomes with specified properties. The epigenomic structure generation workflow opens new directions for designing chromatin accessibility and nucleosome positioning patterns in eukaryotic contexts. Evolutionary biologists and comparative genomicists benefit from a model trained across the full phylogenetic range — allowing cross-species analyses in a unified embedding space.

Impact

Evo 2 represents a significant advance in the scale and capability of genomic foundation models, and its fully open release distinguishes it from many contemporaneous large-scale biological models. The combination of a 40B parameter architecture, 9.3-trillion base-pair training corpus, and 1-million-token context window sets a new benchmark for what sequence-level genomic models can achieve. The demonstrated inference-time scaling result — where additional compute at inference improves the biological quality of generated sequences — introduces a paradigm borrowed from language model research into genomics for the first time. A notable current limitation is that Evo 2 models DNA sequence only; it does not natively integrate RNA-seq, chromatin accessibility, or protein structural data, meaning that multi-modal genomic analyses still require separate tools. Nonetheless, Evo 2's open release and broad capability profile position it as a foundational resource for the next generation of computational genomics research.

Citation

Genome modeling and design across all domains of life with Evo 2

Preprint

Brixi, G., et al. (2025) Genome modeling and design across all domains of life with Evo 2. bioRxiv.

DOI: 10.1101/2025.02.18.638918

Metrics

GitHub

Stars3.8K
Forks482
Open Issues51
Contributors9
Last Push1mo ago
LanguageJupyter Notebook
LicenseApache-2.0

Citations

Total Citations227
Influential35
References0

Tags

variant effect predictionfoundation modelgenerativeDNAgenomics

Resources

GitHub RepositoryResearch PaperOfficial WebsiteHuggingFace ModelDataset