GenomeOcean

DOE Joint Genome Institute / Northwestern University / Johns Hopkins University / University of California, Merced / University of California, Berkeley / Miami University / Illumina

4B-parameter generative genome foundation model trained on assembled environmental metagenomes for microbial representation and de novo DNA design.

Released: February 2025

Parameters: 4 Billion

GenomeOcean is a generative genome foundation model developed at the U.S. Department of Energy Joint Genome Institute (Lawrence Berkeley National Laboratory) with collaborators at Northwestern University and other institutions. First released as a bioRxiv preprint in early 2025, it targets a long-standing bias in genomic language models: most are trained predominantly on curated reference genomes, leaving the vast "rare biosphere" of low-abundance, uncultured microbes underrepresented. To address this, GenomeOcean is trained directly on large-scale co-assemblies of environmental metagenomes rather than on isolate genomes.

The model is built on the hypothesis that functional genomic sequences occupy an infinitesimally small fraction of all possible DNA, implying that evolution explores a low-dimensional "genomic manifold" shaped by universal biochemical and evolutionary constraints. By learning directly from the global microbiome, GenomeOcean aims to capture that manifold and to both represent existing sequences and generate biologically plausible new ones.

A notable finding from the work is convergence with independently trained models: comparisons with Evo 2 report a strong linear correspondence between the two models' embedding spaces and convergent generative behavior, which the authors interpret as evidence that the genomic manifold is a robust biological property rather than an artifact of any single model.

Key Features

Metagenome-scale training: Trained on roughly 645 Gbp of high-quality contigs derived from about 219 TB of raw metagenomic data spanning oceans, lakes, soils, the human microbiome, and polar habitats, emphasizing rare and uncultured microbes.
Generative by design: A decoder-style transformer that can synthesize novel DNA sequences, including protein-coding genes constrained by evolutionary principles, not just embed existing ones.
Efficient tokenization and inference: A byte-pair-encoding (BPE) tokenizer with a 4,096-token vocabulary compresses sequences roughly fivefold, enabling generation reported at over an order of magnitude faster than comparably sized genome models.
Biosynthetic gene cluster design: A fine-tuned variant (4B-bgcFM) can identify candidate biosynthetic gene clusters (BGCs) in genomes and perform zero-shot synthesis of complete, biochemically plausible BGCs.
Model family: Released in 100M, 500M, and 4B parameter sizes, plus specialized fine-tunes for BGC modeling and artificial-sequence detection.

Technical Details

GenomeOcean uses a decoder-only transformer trained with a causal language-modeling objective, incorporating FlashAttention-2, grouped-query attention, and rotary positional embeddings. The flagship model has 4 billion parameters; context length was extended from 1,024 to 10,240 tokens (approximately 50 kb of DNA after BPE tokenization). Training data came from six large co-assembled metagenome collections drawn from diverse global habitats. On reported benchmarks, GenomeOcean achieved an adjusted Rand index of 0.92 for species clustering (versus 0.52 for Evo and 0.81 for tetranucleotide-frequency baselines) and a 99.03% F1 score for detecting artificial sequences (versus 85.12% for DNABERT-2). Embeddings are highly compressible, with most variance captured by a few dozen principal components, consistent with the low-dimensional manifold hypothesis. Openness is mixed: model weights for the family are distributed on HuggingFace (DOEJGI organization) under a permissive BSD license, but only inference code (not training code) is released on GitHub, and the underlying bioRxiv preprint is licensed CC-BY-NC 4.0 (non-commercial). The HuggingFace model card is brief, and no standalone dataset card is published.

Applications

GenomeOcean is aimed at microbial and environmental genomics, where it supports sequence embedding for taxonomic clustering and metagenome analysis, generation of evolutionarily constrained protein-coding sequences, and identification or de novo design of biosynthetic gene clusters relevant to natural-product and enzyme discovery. Its emphasis on uncultured, low-abundance organisms makes it particularly useful for researchers studying the rare biosphere, while its fast generation supports large-scale in silico exploration of sequence space.

Impact

By training directly on the global microbiome rather than reference genomes, GenomeOcean broadens the scope of genome foundation models toward the underrepresented majority of microbial diversity. Its reported linear correspondence with the independently developed Evo 2 model is a striking cross-model result, suggesting that distinct architectures trained on different data converge on a shared, low-dimensional representation of functional genomic sequence. Its openness is partial: the model weights are released under a permissive BSD license, but only inference code is published (there is no training code), and the accompanying preprint carries a non-commercial CC-BY-NC 4.0 license. Within those constraints, GenomeOcean contributes both a practical tool for metagenomic analysis and design and a conceptual framework for understanding the structure of genomic sequence space.

Citation

GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies

Preprint

Zhou, Z., et al. (2025) GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies. bioRxiv.

DOI: 10.1101/2025.01.30.635558

Recent citations

Papers that recently cited this model.

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
Dongxin Ye, Fang Hu, Han Hu, et al.
May 2026
0
Carbon: Decoding the Language of Life
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
0
Asking Back: Interaction-Layer Antidistillation Watermarks
Guang Yang, Amir Ghasemian, Fengchen Liu, et al.
May 2026
0

Top citations

The most-cited papers that cite this model.

GENERator: A Long-Context Generative Genomic Foundation Model
Wei Wu, Qiuyi Li, Yuanyuan Zhang, et al.
Feb 2025
38
Computational Limits of Low-Rank Adaptation (LoRA) Fine-Tuning for Transformer Models
Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, et al.
International Conference on Learning Representations · Jun 2024
36
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model
Mingqian Ma, Guoqing Liu, Chuan Cao, et al.
arXiv.org · Feb 2025
15
Minimalist Softmax Attention Provably Learns Constrained Boolean Functions
Jerry Yao-Chieh Hu, Xiwen Zhang, Maojiang Su, et al.
arXiv.org · May 2025
9
A comprehensive survey of genome language models in bioinformatics
Liyuan Shu, Jiao Tang, Xiaoyu Guan, et al.
Briefings in Bioinformatics · Jan 2026
8

Citations

Total Citations24

Influential0

References121

GitHub

Stars150

Forks11

Open Issues0

Contributors5

Last Push18d ago

LanguagePython

HuggingFace

Downloads1.1K

Likes10

Last Modified1y ago

Fields of citing research

Computer Science95%
Biology73%
Medicine32%
Environmental Science14%
Mathematics9%
Engineering5%
Chemistry5%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

49Partial

Usability — can I run it?87

Reproducibility — can I retrain it?14

open weights, closed recipe

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository bioRxiv Preprint HuggingFace Model

Key Features

Metagenome-scale training: Trained on roughly 645 Gbp of high-quality contigs derived from about 219 TB of raw metagenomic data spanning oceans, lakes, soils, the human microbiome, and polar habitats, emphasizing rare and uncultured microbes.

Generative by design: A decoder-style transformer that can synthesize novel DNA sequences, including protein-coding genes constrained by evolutionary principles, not just embed existing ones.

Efficient tokenization and inference: A byte-pair-encoding (BPE) tokenizer with a 4,096-token vocabulary compresses sequences roughly fivefold, enabling generation reported at over an order of magnitude faster than comparably sized genome models.

Biosynthetic gene cluster design: A fine-tuned variant (4B-bgcFM) can identify candidate biosynthetic gene clusters (BGCs) in genomes and perform zero-shot synthesis of complete, biochemically plausible BGCs.

Model family: Released in 100M, 500M, and 4B parameter sizes, plus specialized fine-tunes for BGC modeling and artificial-sequence detection.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

Dongxin Ye, Fang Hu, Han Hu, et al.

May 2026

Carbon: Decoding the Language of Life

Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.

bioRxiv · May 2026

Asking Back: Interaction-Layer Antidistillation Watermarks

Guang Yang, Amir Ghasemian, Fengchen Liu, et al.

May 2026

Top citations

The most-cited papers that cite this model.

GENERator: A Long-Context Generative Genomic Foundation Model

Wei Wu, Qiuyi Li, Yuanyuan Zhang, et al.

Feb 2025

Computational Limits of Low-Rank Adaptation (LoRA) Fine-Tuning for Transformer Models

Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, et al.

International Conference on Learning Representations · Jun 2024

GenomeOcean

#Key Features

#Technical Details

#Applications

#Impact

Citation

GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies

Recent citations

Asking Back: Interaction-Layer Antidistillation Watermarks

Top citations

GENERator: A Long-Context Generative Genomic Foundation Model

Computational Limits of Low-Rank Adaptation (LoRA) Fine-Tuning for Transformer Models

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GenomeOcean

#Key Features

#Technical Details

#Applications

#Impact

Citation

GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies

Recent citations

Asking Back: Interaction-Layer Antidistillation Watermarks

Top citations

GENERator: A Long-Context Generative Genomic Foundation Model

Computational Limits of Low-Rank Adaptation (LoRA) Fine-Tuning for Transformer Models

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact