MicroGenomer

470M-parameter microbial genome foundation model trained on 234.5B base pairs for multi-scale genomic representation and trait prediction.

Released: December 2025

Parameters: 470 Million

Microorganisms drive biogeochemical cycles and shape human health and environmental sustainability, yet linking their genome sequences to function and ecology remains difficult at scale. MicroGenomer, released by BGI Research in December 2025, is a microbial genome foundation model designed to learn transferable representations that span from individual genes to whole genomes and on to organism-level ecophysiological traits. It targets a gap left by general-purpose DNA language models, which are trained on broad genomic corpora but are not specialized for the diversity and scale of the microbial world.

The model's central design choice is a three-stage hierarchical training pipeline that moves progressively from broad genomic context to microbe-specific signal to concrete downstream tasks. This curriculum lets a comparatively compact model concentrate capacity on microbial genomics rather than spreading it across all of sequence space. Despite having only 470 million parameters, MicroGenomer reports performance on key tasks competitive with the much larger Evo series of genomic models, which the authors describe as nearly 85 times its size.

MicroGenomer is positioned as a practical tool: it produces genome-scale embeddings that feed a set of downstream predictors, and its predictions were tested against real biology. The authors validate trait predictions with wet-lab cultivation of newly isolated strains, reporting that predicted optimal growth conditions agreed well with experimentally measured values.

Key Features

Hierarchical three-stage training: Pre-training on large-scale genomic sequences, domain-specific mid-training on curated microbial marker genes, and task-specific post-training, moving from broad context to focused application.
Compact yet competitive: At 470M parameters the model matches tasks where the Evo series (on the order of tens of billions of parameters) is the comparison, demonstrating strong parameter efficiency.
Multi-scale representations: Single-nucleotide tokenization supports gene-scale mutational effect prediction, genome-scale metabolic analysis, and organism-level trait inference from one embedding space.
Ecophysiological trait prediction: Supports downstream tasks such as optimal growth temperature and pH, oxygen tolerance, growth rate, and probiotic classification.
Wet-lab validation: Predicted optimal growth conditions for newly isolated strains showed high concordance with cultivation experiments, indicating the embeddings carry actionable biological signal.
Open weights and code: Model weights and inference code are released under an MIT license.

Technical Details

MicroGenomer is a 24-layer Transformer encoder with roughly 470 million parameters, operating at single-nucleotide resolution with a context window of 8,192 tokens during pre-training. Pre-training uses the OpenGenome corpus of 234.5 billion nucleotides (approximately 1,080 billion tokens) under a masked language modeling objective. Mid-training then specializes the model on a GTDB-curated microbial marker gene set of about 36 billion nucleotides, comprising coding sequences from roughly 110,000 genomes across 53 archaeal and 120 bacterial marker gene families, with a 2,000-token context. A final task-specific post-training stage adapts the learned representations to individual downstream predictors. The model is evaluated across a broad spectrum of tasks spanning gene-scale mutational effect prediction, genome-scale metabolic model analysis, and ecophysiological trait classification, where it captures phylogenetic structure in species space and reports competitive accuracy against substantially larger baselines such as the Evo series.

Applications

MicroGenomer is aimed at microbiologists, microbiome researchers, and bioprospecting teams who need to predict organism-level properties directly from genome sequence. Its genome-scale embeddings can support taxonomic and phylogenetic analysis, prediction of growth conditions and oxygen requirements for uncultured or newly isolated organisms, metabolic model analysis, and screening candidate strains for traits such as probiotic potential. By guiding which cultivation conditions to attempt, the model can shorten the costly trial-and-error loop of isolating and characterizing environmental and host- associated microbes.

Impact

MicroGenomer contributes to a growing class of microbe-focused genomic foundation models and makes a concrete case that careful, domain-specific training curricula can rival brute-force scale: a 470M-parameter model reaching performance comparable to models tens of times larger lowers the compute barrier for genome-scale microbial inference. The combination of open MIT-licensed weights and explicit wet-lab validation strengthens its credibility as a tool rather than a benchmark artifact. As a December 2025 preprint, its broader adoption and independent benchmarking are still emerging, and reported comparisons should be read in that preliminary context.

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Kang, Q., et al. (2025) MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction. bioRxiv.

DOI: 10.64898/2025.12.28.696777

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References52

GitHub

Stars10

Forks2

Open Issues1

Contributors1

Last Push5mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified5mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

44Partial

Usability — can I run it?83

Reproducibility — can I retrain it?9

open weights, closed recipe

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Hierarchical three-stage training: Pre-training on large-scale genomic sequences, domain-specific mid-training on curated microbial marker genes, and task-specific post-training, moving from broad context to focused application.

Compact yet competitive: At 470M parameters the model matches tasks where the Evo series (on the order of tens of billions of parameters) is the comparison, demonstrating strong parameter efficiency.

Multi-scale representations: Single-nucleotide tokenization supports gene-scale mutational effect prediction, genome-scale metabolic analysis, and organism-level trait inference from one embedding space.

Ecophysiological trait prediction: Supports downstream tasks such as optimal growth temperature and pH, oxygen tolerance, growth rate, and probiotic classification.

Wet-lab validation: Predicted optimal growth conditions for newly isolated strains showed high concordance with cultivation experiments, indicating the embeddings carry actionable biological signal.

Open weights and code: Model weights and inference code are released under an MIT license.

Technical Details

Applications

Impact

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Kang, Q., et al. (2025) MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction. bioRxiv.

DOI: 10.64898/2025.12.28.696777

MicroGenomer

Key Features

Technical Details

Applications

Impact

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MicroGenomer

Key Features

Technical Details

Applications

Impact

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MicroGenomer

#Key Features

#Technical Details

#Applications

#Impact

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MicroGenomer

#Key Features

#Technical Details

#Applications

#Impact

Citation

MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact