bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

GenomeOcean

DOE Joint Genome Institute / Northwestern University / Johns Hopkins University / University of California, Merced / University of California, Berkeley / Miami University / Illumina

A 4-billion-parameter generative genome foundation model trained on assembled global metagenomes for microbial sequence representation and de novo DNA generation.

Released: February 2025
Parameters: 4 Billion

GenomeOcean is a generative genome foundation model developed at the U.S. Department of Energy Joint Genome Institute (Lawrence Berkeley National Laboratory) with collaborators at Northwestern University and other institutions. First released as a bioRxiv preprint in early 2025, it targets a long-standing bias in genomic language models: most are trained predominantly on curated reference genomes, leaving the vast "rare biosphere" of low-abundance, uncultured microbes underrepresented. To address this, GenomeOcean is trained directly on large-scale co-assemblies of environmental metagenomes rather than on isolate genomes.

The model is built on the hypothesis that functional genomic sequences occupy an infinitesimally small fraction of all possible DNA, implying that evolution explores a low-dimensional "genomic manifold" shaped by universal biochemical and evolutionary constraints. By learning directly from the global microbiome, GenomeOcean aims to capture that manifold and to both represent existing sequences and generate biologically plausible new ones.

A notable finding from the work is convergence with independently trained models: comparisons with Evo 2 report a strong linear correspondence between the two models' embedding spaces and convergent generative behavior, which the authors interpret as evidence that the genomic manifold is a robust biological property rather than an artifact of any single model.

#Key Features

  • Metagenome-scale training: Trained on roughly 645 Gbp of high-quality contigs derived from about 219 TB of raw metagenomic data spanning oceans, lakes, soils, the human microbiome, and polar habitats, emphasizing rare and uncultured microbes.
  • Generative by design: A decoder-style transformer that can synthesize novel DNA sequences, including protein-coding genes constrained by evolutionary principles, not just embed existing ones.
  • Efficient tokenization and inference: A byte-pair-encoding (BPE) tokenizer with a 4,096-token vocabulary compresses sequences roughly fivefold, enabling generation reported at over an order of magnitude faster than comparably sized genome models.
  • Biosynthetic gene cluster design: A fine-tuned variant (4B-bgcFM) can identify candidate biosynthetic gene clusters (BGCs) in genomes and perform zero-shot synthesis of complete, biochemically plausible BGCs.
  • Model family: Released in 100M, 500M, and 4B parameter sizes, plus specialized fine-tunes for BGC modeling and artificial-sequence detection.

#Technical Details

GenomeOcean uses a decoder-only transformer trained with a causal language-modeling objective, incorporating FlashAttention-2, grouped-query attention, and rotary positional embeddings. The flagship model has 4 billion parameters; context length was extended from 1,024 to 10,240 tokens (approximately 50 kb of DNA after BPE tokenization). Training data came from six large co-assembled metagenome collections drawn from diverse global habitats. On reported benchmarks, GenomeOcean achieved an adjusted Rand index of 0.92 for species clustering (versus 0.52 for Evo and 0.81 for tetranucleotide-frequency baselines) and a 99.03% F1 score for detecting artificial sequences (versus 85.12% for DNABERT-2). Embeddings are highly compressible, with most variance captured by a few dozen principal components, consistent with the low-dimensional manifold hypothesis. Openness is mixed: model weights for the family are distributed on HuggingFace (DOEJGI organization) under a permissive BSD license, but only inference code (not training code) is released on GitHub, and the underlying bioRxiv preprint is licensed CC-BY-NC 4.0 (non-commercial). The HuggingFace model card is brief, and no standalone dataset card is published.

#Applications

GenomeOcean is aimed at microbial and environmental genomics, where it supports sequence embedding for taxonomic clustering and metagenome analysis, generation of evolutionarily constrained protein-coding sequences, and identification or de novo design of biosynthetic gene clusters relevant to natural-product and enzyme discovery. Its emphasis on uncultured, low-abundance organisms makes it particularly useful for researchers studying the rare biosphere, while its fast generation supports large-scale in silico exploration of sequence space.

#Impact

By training directly on the global microbiome rather than reference genomes, GenomeOcean broadens the scope of genome foundation models toward the underrepresented majority of microbial diversity. Its reported linear correspondence with the independently developed Evo 2 model is a striking cross-model result, suggesting that distinct architectures trained on different data converge on a shared, low-dimensional representation of functional genomic sequence. Its openness is partial: the model weights are released under a permissive BSD license, but only inference code is published (there is no training code), and the accompanying preprint carries a non-commercial CC-BY-NC 4.0 license. Within those constraints, GenomeOcean contributes both a practical tool for metagenomic analysis and design and a conceptual framework for understanding the structure of genomic sequence space.

Citation

Preprint

DOI: 10.1101/2025.01.30.635558

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
49Partial
Usability — can I run it?87
Reproducibility — can I retrain it?14
open weights, closed recipe
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

de_novo_designfoundation_modelgenerativemetagenomicsmicrobiomerepresentation_learningself_supervisedsequence_generationtransformer

Resources

GitHub RepositorybioRxiv PreprintHuggingFace Model