BiomeGPT is a transformer-based foundation model for the human gut microbiome, developed at Massachusetts General Hospital (Harvard Medical School) and released as a bioRxiv preprint in January 2026. It addresses a persistent challenge in microbiome science: shotgun metagenomic profiles are high-dimensional, sparse, and compositional, and individual studies are typically too small and too heterogeneous to train robust predictive models. By pretraining a single model across many cohorts and conditions, BiomeGPT aims to learn transferable, species-level representations of microbial community composition that generalize beyond any one disease or dataset.

Conceptually, BiomeGPT brings the foundation-model paradigm — large-scale self-supervised pretraining followed by adaptation to downstream tasks — to taxonomic microbiome data, much as protein and genomic language models have done for sequences. Rather than hand-engineering features for each new study, it treats the relative abundances of microbial species as the model's "vocabulary" and learns contextual relationships among taxa directly from data. This positions it alongside other emerging microbiome and metagenomics models while remaining distinct in its scale and its explicit framing as a pretrained, reusable backbone.

The model was authored by Nicholas A. Medearis, Siyao Zhu, and Ali R. Zomorrodi. It is motivated by the goal of turning the growing catalog of public human gut metagenomes into a shared resource for disease classification and biomarker discovery across many phenotypes at once.

Key Features

Cross-cohort pretraining: Learns from more than 13,300 human gut metagenomes aggregated across many studies, giving it exposure to far more variation than any single-cohort model.
Broad phenotype coverage: Pretraining spans 32 phenotypes — healthy controls plus 31 distinct disease states — so a single model encodes signal relevant to a wide range of conditions.
Species-level representations: Captures relationships among microbial species to produce embeddings of community composition usable for downstream prediction.
Disease classification: Designed to be adapted for classifying disease status from gut microbiome profiles, leveraging shared structure across cohorts.
Biomarker discovery: Supports identification of taxa associated with specific phenotypes, helping prioritize candidate microbial biomarkers.

Technical Details

BiomeGPT is a transformer foundation model pretrained in a self-supervised manner on species-level taxonomic profiles derived from more than 13,300 human gut shotgun metagenomes. The training corpus spans 32 phenotypes, comprising healthy individuals and 31 disease states, allowing the model to learn representations that are not specialized to a single condition. After pretraining, the learned representations are applied to downstream tasks including disease classification and biomarker discovery. As the work is a January 2026 preprint, reported architecture hyperparameters, parameter counts, and benchmark comparisons should be read as preliminary and confirmed against the manuscript; the authors frame the contribution as a reusable backbone for microbiome prediction rather than a single fixed classifier.

Applications

BiomeGPT is intended for researchers working with human gut metagenomic data who want to classify disease status or surface candidate microbial biomarkers without training a model from scratch for each study. Because it is pretrained across many cohorts and phenotypes, it is especially relevant where individual datasets are small or where investigators wish to compare signal across multiple conditions. Potential beneficiaries include microbiome epidemiologists, translational and clinical researchers exploring gut-disease associations, and computational biologists building downstream diagnostic or stratification pipelines.

Impact

BiomeGPT extends the foundation-model approach that has reshaped protein and genomic modeling into taxonomic microbiome analysis, offering a pretrained backbone meant to improve generalization across the fragmented landscape of human gut metagenomic studies. As a recent single-institution preprint, its real-world adoption and independent validation remain to be established. A notable practical limitation is openness: at the time of writing, no public code repository or model weights were located, and the preprint is released under a CC-BY-NC-ND license with the software license unspecified, which currently constrains reuse and reproducibility until artifacts are released.

Key Features

Cross-cohort pretraining: Learns from more than 13,300 human gut metagenomes aggregated across many studies, giving it exposure to far more variation than any single-cohort model.

Broad phenotype coverage: Pretraining spans 32 phenotypes — healthy controls plus 31 distinct disease states — so a single model encodes signal relevant to a wide range of conditions.

Species-level representations: Captures relationships among microbial species to produce embeddings of community composition usable for downstream prediction.

Disease classification: Designed to be adapted for classifying disease status from gut microbiome profiles, leveraging shared structure across cohorts.

Biomarker discovery: Supports identification of taxa associated with specific phenotypes, helping prioritize candidate microbial biomarkers.

Technical Details

Applications

Impact

BiomeGPT

Key Features

Technical Details

Applications

Impact

Citation

BiomeGPT: A foundation model for the human gut microbiome

Recent citations

Learning the Language of the Microbiome with Transformers

Top citations

Learning the Language of the Microbiome with Transformers

Citations

Fields of citing research

Openness

Resources

BiomeGPT

Key Features

Technical Details

Applications

Impact

Citation

BiomeGPT: A foundation model for the human gut microbiome

Recent citations

Learning the Language of the Microbiome with Transformers

Top citations

Learning the Language of the Microbiome with Transformers

Citations

Fields of citing research

Openness

Resources

BiomeGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

BiomeGPT: A foundation model for the human gut microbiome

Recent citations

Learning the Language of the Microbiome with Transformers

Top citations

Learning the Language of the Microbiome with Transformers

Citations

Fields of citing research

Openness

Resources

BiomeGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

BiomeGPT: A foundation model for the human gut microbiome

Recent citations

Learning the Language of the Microbiome with Transformers

Top citations

Learning the Language of the Microbiome with Transformers

Citations

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact