Massachusetts General Hospital
A transformer foundation model pretrained on 13,300+ human gut metagenomes across 32 phenotypes to learn species-level microbiome representations for disease classification and biomarker discovery.
BiomeGPT is a transformer-based foundation model for the human gut microbiome, developed at Massachusetts General Hospital (Harvard Medical School) and released as a bioRxiv preprint in January 2026. It addresses a persistent challenge in microbiome science: shotgun metagenomic profiles are high-dimensional, sparse, and compositional, and individual studies are typically too small and too heterogeneous to train robust predictive models. By pretraining a single model across many cohorts and conditions, BiomeGPT aims to learn transferable, species-level representations of microbial community composition that generalize beyond any one disease or dataset.
Conceptually, BiomeGPT brings the foundation-model paradigm — large-scale self-supervised pretraining followed by adaptation to downstream tasks — to taxonomic microbiome data, much as protein and genomic language models have done for sequences. Rather than hand-engineering features for each new study, it treats the relative abundances of microbial species as the model's "vocabulary" and learns contextual relationships among taxa directly from data. This positions it alongside other emerging microbiome and metagenomics models while remaining distinct in its scale and its explicit framing as a pretrained, reusable backbone.
The model was authored by Nicholas A. Medearis, Siyao Zhu, and Ali R. Zomorrodi. It is motivated by the goal of turning the growing catalog of public human gut metagenomes into a shared resource for disease classification and biomarker discovery across many phenotypes at once.
BiomeGPT is a transformer foundation model pretrained in a self-supervised manner on species-level taxonomic profiles derived from more than 13,300 human gut shotgun metagenomes. The training corpus spans 32 phenotypes, comprising healthy individuals and 31 disease states, allowing the model to learn representations that are not specialized to a single condition. After pretraining, the learned representations are applied to downstream tasks including disease classification and biomarker discovery. As the work is a January 2026 preprint, reported architecture hyperparameters, parameter counts, and benchmark comparisons should be read as preliminary and confirmed against the manuscript; the authors frame the contribution as a reusable backbone for microbiome prediction rather than a single fixed classifier.
BiomeGPT is intended for researchers working with human gut metagenomic data who want to classify disease status or surface candidate microbial biomarkers without training a model from scratch for each study. Because it is pretrained across many cohorts and phenotypes, it is especially relevant where individual datasets are small or where investigators wish to compare signal across multiple conditions. Potential beneficiaries include microbiome epidemiologists, translational and clinical researchers exploring gut-disease associations, and computational biologists building downstream diagnostic or stratification pipelines.
BiomeGPT extends the foundation-model approach that has reshaped protein and genomic modeling into taxonomic microbiome analysis, offering a pretrained backbone meant to improve generalization across the fragmented landscape of human gut metagenomic studies. As a recent single-institution preprint, its real-world adoption and independent validation remain to be established. A notable practical limitation is openness: at the time of writing, no public code repository or model weights were located, and the preprint is released under a CC-BY-NC-ND license with the software license unspecified, which currently constrains reuse and reproducibility until artifacts are released.
Medearis, N. A., et al. (2026) BiomeGPT: A foundation model for the human gut microbiome. bioRxiv.
DOI: 10.64898/2026.01.05.697599Papers that recently cited this model.
Neythen J. Treloar, S. ur-Rehman, Jenny Yang
bioRxiv · May 2026
The most-cited papers that cite this model.
Neythen J. Treloar, S. ur-Rehman, Jenny Yang
bioRxiv · May 2026
Share of papers citing this model.