Genos-m

Mixture-of-Experts genomic foundation model for the human microbiome, with 4.7B parameters pretrained on bacterial, archaeal, and phage genomes.

Released: May 2026

Parameters: 4.7 Billion

Genos-m is a genomic foundation model purpose-built for the human microbiome, the dense and taxonomically diverse community of microbes that colonize the gut, oral cavity, skin, respiratory tract, and urogenital niches. While general nucleotide language models such as the Nucleotide Transformer, HyenaDNA, and Evo have largely been trained on reference genomes dominated by eukaryotic or broad prokaryotic sequence, Genos-m concentrates its capacity on the bacteria, archaea, and bacteriophages that are most relevant to human health. It was developed by the Genos team at BGI-HangzhouAI (Hangzhou, China) and released as a bioRxiv preprint in May 2026.

The model addresses a practical gap: microbiome research increasingly needs representations that transfer across strains, species, and functional elements without task-specific retraining. Genos-m is designed to be used in frozen-representation mode, meaning its pretrained embeddings can be fed to lightweight downstream heads for regression and classification without fine-tuning the backbone. This makes it well suited to the many small, noisy, heterogeneous datasets typical of microbial functional genomics.

By scaling a sparse Mixture-of-Experts (MoE) architecture to 4.7 billion total parameters while activating only 330 million per token, Genos-m aims to capture the breadth of microbial sequence diversity at a manageable inference cost, positioning it as a microbiome-specialized complement to general-purpose DNA foundation models.

Key Features

Microbiome-specialized pretraining: Trained exclusively on human-associated microbial genomes spanning 186 phyla, 3,448 families, and 69,056 species across five body-site niches, rather than generic reference collections.
Sparse Mixture-of-Experts backbone: 4.7B total parameters with only 330M activated per token, decoupling model capacity from per-token compute.
Long-context modeling: Supports up to 1 million base pairs of context, enabling whole-operon, biosynthetic-gene-cluster, and genome-scale reasoning.
Frozen-representation evaluation: Benchmarked entirely without retraining the backbone, using fixed embeddings plus lightweight heads across diverse tasks.
Cross-modality transfer: Demonstrates zero-shot transfer of fitness prediction from DNA to RNA, indicating broadly useful learned representations.

Technical Details

Genos-m is a decoder-style Transformer with sparse Mixture-of-Experts feed-forward layers, totaling 4.7 billion parameters of which roughly 330 million are activated per token. Pretraining used approximately 1.2 trillion nucleotide tokens drawn from human-associated microbial genomes, including prokaryotic isolates, metagenome-assembled genomes (MAGs), bacteriophages, and GTDB reference genomes, covering 186 phyla, 3,448 families, and 69,056 species. The context window extends to 1 million base pairs. Evaluation was conducted in frozen-representation mode across eight gene-fitness regression tasks, biosynthetic gene cluster (BGC) classification, whole-genome strain phenotype prediction, and a zero-shot RNA fitness transfer task, with no backbone retraining on any benchmark. Model weights are released under Apache 2.0 in two checkpoints (Genos-m-4.7B and a Megatron variant), with the paper under CC BY.

Applications

Genos-m supports a range of microbiome and microbial genomics workflows: predicting gene fitness effects, classifying biosynthetic gene clusters for natural-product and antibiotic discovery, and predicting strain-level phenotypes from whole genomes. Because it operates from frozen embeddings, researchers can attach simple downstream models to tackle small or imbalanced datasets common in functional microbiology, metagenomics, and translational microbiome studies. Its demonstrated DNA-to-RNA transfer further suggests utility for RNA-level fitness questions without dedicated RNA pretraining.

Impact

Genos-m extends the genomic foundation model paradigm into the human microbiome, a domain underserved by general DNA language models despite its centrality to health and disease. By pairing microbiome-focused pretraining with an efficient sparse MoE design, long context, and openly released Apache 2.0 weights, it offers the community a reusable backbone for microbial functional prediction. As a recent preprint, its benchmarks await peer review and independent reproduction, and its performance relative to general-purpose models such as Evo across broader tasks remains to be established, but its frozen-representation results across taxonomically diverse tasks signal a promising specialization strategy.

Citation

Genos-m: a foundation model for human-associated microbial genomes

Fang, C., et al. (2026) Genos-m: a foundation model for human-associated microbial genomes. bioRxiv.

DOI: 10.64898/2026.05.21.726868

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References64

GitHub

Stars26

Forks3

Open Issues1

Contributors2

Last Push1mo ago

LanguageShell

LicenseApache-2.0

HuggingFace

Downloads117

Likes1

Last Modified2mo ago

Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

73Open

Usability — can I run it?92

Reproducibility — can I retrain it?42

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Microbiome-specialized pretraining: Trained exclusively on human-associated microbial genomes spanning 186 phyla, 3,448 families, and 69,056 species across five body-site niches, rather than generic reference collections.

Sparse Mixture-of-Experts backbone: 4.7B total parameters with only 330M activated per token, decoupling model capacity from per-token compute.

Long-context modeling: Supports up to 1 million base pairs of context, enabling whole-operon, biosynthetic-gene-cluster, and genome-scale reasoning.

Frozen-representation evaluation: Benchmarked entirely without retraining the backbone, using fixed embeddings plus lightweight heads across diverse tasks.

Cross-modality transfer: Demonstrates zero-shot transfer of fitness prediction from DNA to RNA, indicating broadly useful learned representations.

Technical Details

Applications

Impact

Genos-m

Key Features

Technical Details

Applications

Impact

Citation

Genos-m: a foundation model for human-associated microbial genomes

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Genos-m

Key Features

Technical Details

Applications

Impact

Citation

Genos-m: a foundation model for human-associated microbial genomes

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Genos-m

#Key Features

#Technical Details

#Applications

#Impact

Citation

Genos-m: a foundation model for human-associated microbial genomes

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Genos-m

#Key Features

#Technical Details

#Applications

#Impact

Citation

Genos-m: a foundation model for human-associated microbial genomes

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact