bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & GeneLanguage model

BiomeGPT

Massachusetts General Hospital

A transformer foundation model pretrained on 13,300+ human gut metagenomes across 32 phenotypes to learn species-level microbiome representations for disease classification and biomarker discovery.

Released: January 2026

BiomeGPT is a transformer-based foundation model for the human gut microbiome, developed at Massachusetts General Hospital (Harvard Medical School) and released as a bioRxiv preprint in January 2026. It addresses a persistent challenge in microbiome science: shotgun metagenomic profiles are high-dimensional, sparse, and compositional, and individual studies are typically too small and too heterogeneous to train robust predictive models. By pretraining a single model across many cohorts and conditions, BiomeGPT aims to learn transferable, species-level representations of microbial community composition that generalize beyond any one disease or dataset.

Conceptually, BiomeGPT brings the foundation-model paradigm — large-scale self-supervised pretraining followed by adaptation to downstream tasks — to taxonomic microbiome data, much as protein and genomic language models have done for sequences. Rather than hand-engineering features for each new study, it treats the relative abundances of microbial species as the model's "vocabulary" and learns contextual relationships among taxa directly from data. This positions it alongside other emerging microbiome and metagenomics models while remaining distinct in its scale and its explicit framing as a pretrained, reusable backbone.

The model was authored by Nicholas A. Medearis, Siyao Zhu, and Ali R. Zomorrodi. It is motivated by the goal of turning the growing catalog of public human gut metagenomes into a shared resource for disease classification and biomarker discovery across many phenotypes at once.

#Key Features

  • Cross-cohort pretraining: Learns from more than 13,300 human gut metagenomes aggregated across many studies, giving it exposure to far more variation than any single-cohort model.
  • Broad phenotype coverage: Pretraining spans 32 phenotypes — healthy controls plus 31 distinct disease states — so a single model encodes signal relevant to a wide range of conditions.
  • Species-level representations: Captures relationships among microbial species to produce embeddings of community composition usable for downstream prediction.
  • Disease classification: Designed to be adapted for classifying disease status from gut microbiome profiles, leveraging shared structure across cohorts.
  • Biomarker discovery: Supports identification of taxa associated with specific phenotypes, helping prioritize candidate microbial biomarkers.

#Technical Details

BiomeGPT is a transformer foundation model pretrained in a self-supervised manner on species-level taxonomic profiles derived from more than 13,300 human gut shotgun metagenomes. The training corpus spans 32 phenotypes, comprising healthy individuals and 31 disease states, allowing the model to learn representations that are not specialized to a single condition. After pretraining, the learned representations are applied to downstream tasks including disease classification and biomarker discovery. As the work is a January 2026 preprint, reported architecture hyperparameters, parameter counts, and benchmark comparisons should be read as preliminary and confirmed against the manuscript; the authors frame the contribution as a reusable backbone for microbiome prediction rather than a single fixed classifier.

#Applications

BiomeGPT is intended for researchers working with human gut metagenomic data who want to classify disease status or surface candidate microbial biomarkers without training a model from scratch for each study. Because it is pretrained across many cohorts and phenotypes, it is especially relevant where individual datasets are small or where investigators wish to compare signal across multiple conditions. Potential beneficiaries include microbiome epidemiologists, translational and clinical researchers exploring gut-disease associations, and computational biologists building downstream diagnostic or stratification pipelines.

#Impact

BiomeGPT extends the foundation-model approach that has reshaped protein and genomic modeling into taxonomic microbiome analysis, offering a pretrained backbone meant to improve generalization across the fragmented landscape of human gut metagenomic studies. As a recent single-institution preprint, its real-world adoption and independent validation remain to be established. A notable practical limitation is openness: at the time of writing, no public code repository or model weights were located, and the preprint is released under a CC-BY-NC-ND license with the software license unspecified, which currently constrains reuse and reproducibility until artifacts are released.

Citation

BiomeGPT: A foundation model for the human gut microbiome

Medearis, N. A., et al. (2026) BiomeGPT: A foundation model for the human gut microbiome. bioRxiv.

DOI: 10.64898/2026.01.05.697599

Recent citations

Papers that recently cited this model.

  • Learning the Language of the Microbiome with Transformers

    Neythen J. Treloar, S. ur-Rehman, Jenny Yang

    bioRxiv · May 2026

    0Influential

Top citations

The most-cited papers that cite this model.

  • Learning the Language of the Microbiome with Transformers

    Neythen J. Treloar, S. ur-Rehman, Jenny Yang

    bioRxiv · May 2026

    0Influential

Citations

Total Citations1
Influential1
References0

Fields of citing research

  • Biology100%
  • Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
8Closed
Usability — can I run it?7
Reproducibility — can I retrain it?10
Model Openness Framework
Unclassified
Restrictive license on core components

Resources

Research Paper