DOE Joint Genome Institute / Northwestern University / Johns Hopkins University / University of California, Merced / University of California, Berkeley / Miami University / Illumina
A 4-billion-parameter generative genome foundation model trained on assembled global metagenomes for microbial sequence representation and de novo DNA generation.
GenomeOcean is a generative genome foundation model developed at the U.S. Department of Energy Joint Genome Institute (Lawrence Berkeley National Laboratory) with collaborators at Northwestern University and other institutions. First released as a bioRxiv preprint in early 2025, it targets a long-standing bias in genomic language models: most are trained predominantly on curated reference genomes, leaving the vast "rare biosphere" of low-abundance, uncultured microbes underrepresented. To address this, GenomeOcean is trained directly on large-scale co-assemblies of environmental metagenomes rather than on isolate genomes.
The model is built on the hypothesis that functional genomic sequences occupy an infinitesimally small fraction of all possible DNA, implying that evolution explores a low-dimensional "genomic manifold" shaped by universal biochemical and evolutionary constraints. By learning directly from the global microbiome, GenomeOcean aims to capture that manifold and to both represent existing sequences and generate biologically plausible new ones.
A notable finding from the work is convergence with independently trained models: comparisons with Evo 2 report a strong linear correspondence between the two models' embedding spaces and convergent generative behavior, which the authors interpret as evidence that the genomic manifold is a robust biological property rather than an artifact of any single model.
GenomeOcean uses a decoder-only transformer trained with a causal language-modeling objective, incorporating FlashAttention-2, grouped-query attention, and rotary positional embeddings. The flagship model has 4 billion parameters; context length was extended from 1,024 to 10,240 tokens (approximately 50 kb of DNA after BPE tokenization). Training data came from six large co-assembled metagenome collections drawn from diverse global habitats. On reported benchmarks, GenomeOcean achieved an adjusted Rand index of 0.92 for species clustering (versus 0.52 for Evo and 0.81 for tetranucleotide-frequency baselines) and a 99.03% F1 score for detecting artificial sequences (versus 85.12% for DNABERT-2). Embeddings are highly compressible, with most variance captured by a few dozen principal components, consistent with the low-dimensional manifold hypothesis. Openness is mixed: model weights for the family are distributed on HuggingFace (DOEJGI organization) under a permissive BSD license, but only inference code (not training code) is released on GitHub, and the underlying bioRxiv preprint is licensed CC-BY-NC 4.0 (non-commercial). The HuggingFace model card is brief, and no standalone dataset card is published.
GenomeOcean is aimed at microbial and environmental genomics, where it supports sequence embedding for taxonomic clustering and metagenome analysis, generation of evolutionarily constrained protein-coding sequences, and identification or de novo design of biosynthetic gene clusters relevant to natural-product and enzyme discovery. Its emphasis on uncultured, low-abundance organisms makes it particularly useful for researchers studying the rare biosphere, while its fast generation supports large-scale in silico exploration of sequence space.
By training directly on the global microbiome rather than reference genomes, GenomeOcean broadens the scope of genome foundation models toward the underrepresented majority of microbial diversity. Its reported linear correspondence with the independently developed Evo 2 model is a striking cross-model result, suggesting that distinct architectures trained on different data converge on a shared, low-dimensional representation of functional genomic sequence. Its openness is partial: the model weights are released under a permissive BSD license, but only inference code is published (there is no training code), and the accompanying preprint carries a non-commercial CC-BY-NC 4.0 license. Within those constraints, GenomeOcean contributes both a practical tool for metagenomic analysis and design and a conceptual framework for understanding the structure of genomic sequence space.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data