A 470M-parameter microbial genome foundation model trained hierarchically on 234.5B bp for multi-scale genomic understanding and ecophysiological trait prediction.
Microorganisms drive biogeochemical cycles and shape human health and environmental sustainability, yet linking their genome sequences to function and ecology remains difficult at scale. MicroGenomer, released by BGI Research in December 2025, is a microbial genome foundation model designed to learn transferable representations that span from individual genes to whole genomes and on to organism-level ecophysiological traits. It targets a gap left by general-purpose DNA language models, which are trained on broad genomic corpora but are not specialized for the diversity and scale of the microbial world.
The model's central design choice is a three-stage hierarchical training pipeline that moves progressively from broad genomic context to microbe-specific signal to concrete downstream tasks. This curriculum lets a comparatively compact model concentrate capacity on microbial genomics rather than spreading it across all of sequence space. Despite having only 470 million parameters, MicroGenomer reports performance on key tasks competitive with the much larger Evo series of genomic models, which the authors describe as nearly 85 times its size.
MicroGenomer is positioned as a practical tool: it produces genome-scale embeddings that feed a set of downstream predictors, and its predictions were tested against real biology. The authors validate trait predictions with wet-lab cultivation of newly isolated strains, reporting that predicted optimal growth conditions agreed well with experimentally measured values.
MicroGenomer is a 24-layer Transformer encoder with roughly 470 million parameters, operating at single-nucleotide resolution with a context window of 8,192 tokens during pre-training. Pre-training uses the OpenGenome corpus of 234.5 billion nucleotides (approximately 1,080 billion tokens) under a masked language modeling objective. Mid-training then specializes the model on a GTDB-curated microbial marker gene set of about 36 billion nucleotides, comprising coding sequences from roughly 110,000 genomes across 53 archaeal and 120 bacterial marker gene families, with a 2,000-token context. A final task-specific post-training stage adapts the learned representations to individual downstream predictors. The model is evaluated across a broad spectrum of tasks spanning gene-scale mutational effect prediction, genome-scale metabolic model analysis, and ecophysiological trait classification, where it captures phylogenetic structure in species space and reports competitive accuracy against substantially larger baselines such as the Evo series.
MicroGenomer is aimed at microbiologists, microbiome researchers, and bioprospecting teams who need to predict organism-level properties directly from genome sequence. Its genome-scale embeddings can support taxonomic and phylogenetic analysis, prediction of growth conditions and oxygen requirements for uncultured or newly isolated organisms, metabolic model analysis, and screening candidate strains for traits such as probiotic potential. By guiding which cultivation conditions to attempt, the model can shorten the costly trial-and-error loop of isolating and characterizing environmental and host- associated microbes.
MicroGenomer contributes to a growing class of microbe-focused genomic foundation models and makes a concrete case that careful, domain-specific training curricula can rival brute-force scale: a 470M-parameter model reaching performance comparable to models tens of times larger lowers the compute barrier for genome-scale microbial inference. The combination of open MIT-licensed weights and explicit wet-lab validation strengthens its credibility as a tool rather than a benchmark artifact. As a December 2025 preprint, its broader adoption and independent benchmarking are still emerging, and reported comparisons should be read in that preliminary context.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data