A 4.7B-parameter Mixture-of-Experts genomic foundation model pretrained on ~1.2 trillion nucleotide tokens from human-associated microbial genomes.
Genos-m is a genomic foundation model purpose-built for the human microbiome, the dense and taxonomically diverse community of microbes that colonize the gut, oral cavity, skin, respiratory tract, and urogenital niches. While general nucleotide language models such as the Nucleotide Transformer, HyenaDNA, and Evo have largely been trained on reference genomes dominated by eukaryotic or broad prokaryotic sequence, Genos-m concentrates its capacity on the bacteria, archaea, and bacteriophages that are most relevant to human health. It was developed by the Genos team at BGI-HangzhouAI (Hangzhou, China) and released as a bioRxiv preprint in May 2026.
The model addresses a practical gap: microbiome research increasingly needs representations that transfer across strains, species, and functional elements without task-specific retraining. Genos-m is designed to be used in frozen-representation mode, meaning its pretrained embeddings can be fed to lightweight downstream heads for regression and classification without fine-tuning the backbone. This makes it well suited to the many small, noisy, heterogeneous datasets typical of microbial functional genomics.
By scaling a sparse Mixture-of-Experts (MoE) architecture to 4.7 billion total parameters while activating only 330 million per token, Genos-m aims to capture the breadth of microbial sequence diversity at a manageable inference cost, positioning it as a microbiome-specialized complement to general-purpose DNA foundation models.
Genos-m is a decoder-style Transformer with sparse Mixture-of-Experts feed-forward layers, totaling 4.7 billion parameters of which roughly 330 million are activated per token. Pretraining used approximately 1.2 trillion nucleotide tokens drawn from human-associated microbial genomes, including prokaryotic isolates, metagenome-assembled genomes (MAGs), bacteriophages, and GTDB reference genomes, covering 186 phyla, 3,448 families, and 69,056 species. The context window extends to 1 million base pairs. Evaluation was conducted in frozen-representation mode across eight gene-fitness regression tasks, biosynthetic gene cluster (BGC) classification, whole-genome strain phenotype prediction, and a zero-shot RNA fitness transfer task, with no backbone retraining on any benchmark. Model weights are released under Apache 2.0 in two checkpoints (Genos-m-4.7B and a Megatron variant), with the paper under CC BY.
Genos-m supports a range of microbiome and microbial genomics workflows: predicting gene fitness effects, classifying biosynthetic gene clusters for natural-product and antibiotic discovery, and predicting strain-level phenotypes from whole genomes. Because it operates from frozen embeddings, researchers can attach simple downstream models to tackle small or imbalanced datasets common in functional microbiology, metagenomics, and translational microbiome studies. Its demonstrated DNA-to-RNA transfer further suggests utility for RNA-level fitness questions without dedicated RNA pretraining.
Genos-m extends the genomic foundation model paradigm into the human microbiome, a domain underserved by general DNA language models despite its centrality to health and disease. By pairing microbiome-focused pretraining with an efficient sparse MoE design, long context, and openly released Apache 2.0 weights, it offers the community a reusable backbone for microbial functional prediction. As a recent preprint, its benchmarks await peer review and independent reproduction, and its performance relative to general-purpose models such as Evo across broader tasks remains to be established, but its frozen-representation results across taxonomically diverse tasks signal a promising specialization strategy.
Fang, C., et al. (2026) Genos-m: a foundation model for human-associated microbial genomes. bioRxiv.
DOI: 10.64898/2026.05.21.726868