A 7B parameter genomic foundation model using StripedHyena architecture to model prokaryotic DNA, RNA, and proteins at single-nucleotide resolution with 131k token context.
Evo is a 7-billion-parameter genomic foundation model developed by the Arc Institute in collaboration with Hazy Research and Together AI, published in Science in November 2024. The model was built to address a fundamental limitation of existing biological language models: their inability to process DNA at the raw nucleotide level across the full range of scales at which genomic information operates — from individual codons to entire bacterial chromosomes. Evo operates at single-nucleotide, byte-level resolution, processing up to 131,072 nucleotides in a single context window, enabling it to model the long-range dependencies that govern genome organization and function.
The central insight motivating Evo is that the three molecular layers of the central dogma — DNA, RNA, and protein — are not independent. Gene regulatory elements, non-coding RNAs, and protein-coding sequences are physically and evolutionarily intertwined within whole genomes. By training on 2.7 million complete prokaryotic and phage genomes (300 billion nucleotide tokens, compiled into a dataset called OpenGenome), Evo learns the statistical relationships across all three modalities simultaneously, without requiring separate tokenization schemes or modality-specific fine-tuning. This enables genuine multimodal zero-shot inference from a single genomic model.
Evo was trained using a next-token prediction objective on raw DNA sequences and evaluated across an unusually diverse set of tasks, including bacterial protein mutation effect prediction, non-coding RNA fitness, promoter activity, gene essentiality, and generative design of multi-element CRISPR systems and transposable elements. It was the first model to experimentally validate AI-generated protein-RNA and protein-DNA codesign, representing a qualitative advance over prior generative biology tools.
Evo is built on the StripedHyena architecture, which alternates Hyena operators — input-dependent long convolution filters that efficiently capture local and global sequence context — with sparse multi-head attention layers. The 32 blocks operate at a model width of 4,096 dimensions. Hyena layers use compositions of short and long convolution filters applied in a data-controlled manner, making them effective at aggregating nucleotide sequences into higher-order motifs while filtering noise. The 10% attention layers (3 of 32) retain the capacity for precise long-range token interactions that benefit certain genomic signals.
Training used the OpenGenome dataset: 2.7 million raw prokaryotic and phage genome sequences totaling 300 billion nucleotide tokens. Eukaryotic and human sequences were intentionally excluded for biosafety reasons. A two-stage training approach was used, first pretraining at shorter context lengths and then extending to 131k tokens, a strategy borrowed from large language model context extension methods. The model was trained in collaboration with Together AI using distributed GPU infrastructure.
Benchmarks include gene essentiality prediction with 0.90 AUROC on lambda phage data, mean promoter activity correlation of 0.43 across independent studies, and zero-shot mutation effect predictions competitive with ESM-1v and other protein-specific models on bacterial deep mutational scanning datasets.
Evo is suited for research at the intersection of genome biology and generative AI. Microbiologists can use the model's zero-shot scoring to prioritize variants in bacterial genetic screens or to predict the fitness consequences of mutations in non-model organisms lacking deep mutational scanning data. Synthetic biologists can use Evo's generative capabilities to design novel CRISPR systems, regulatory circuits, or transposable element-derived delivery vehicles. Genomics researchers can leverage the model's learned representations for tasks such as promoter activity prediction, essential gene identification, or genomic element annotation without labeled training data. Because Evo operates at the whole-genome scale, it is particularly valuable for studying genomic context effects — regulatory interactions, gene synteny, and co-evolutionary constraints — that single-gene or single-molecule models cannot capture.
Evo established that long-context, byte-level genomic models using non-transformer architectures can match or exceed domain-specific models trained exclusively on proteins or RNA. Its experimental validation of generative CRISPR and transposon design marked the first time a single language model was used to co-design interacting DNA and protein components, a result that attracted broad attention across the synthetic biology and genomics communities. The model weights are publicly released on HuggingFace via Together AI, lowering the barrier for academic groups to apply large-scale genomic modeling to their research. A limitation of Evo (v1) is its restriction to prokaryotic and phage genomes; eukaryotic biology, including human genetics and gene regulation in higher organisms, falls outside its training distribution. This scope was directly addressed in the subsequent Evo 2 model, which extended training to genomes from all domains of life.
Nguyen, E., et al. (2024) Sequence modeling and design from molecular to genome scale with Evo.. Science.
DOI: 10.1126/science.ado9336