A family of GPT-2-style causal language models pretrained on 539,000+ microbiome samples, enabling zero- and few-shot transfer across microbiome prediction tasks.
Waypoint is a family of transformer-based foundation models that treat the human microbiome as a "language," learning generalizable representations of microbial community composition through self-supervised pretraining. Developed by Outpost Bio, a London-based company building computational infrastructure for human microbiology, the models were introduced in the May 2026 bioRxiv preprint "Learning the Language of the Microbiome with Transformers." The core idea is to adapt the causal language modeling paradigm that powers large language models to microbiome data, where the "tokens" describe the microbial taxa and abundances that make up a community sample.
Microbiome research has historically relied on classical statistical and machine learning methods trained from scratch for each specific question—biome classification, host phenotype prediction, or community dynamics. These approaches struggle to transfer across studies and require labeled data that is often scarce. Waypoint reframes the problem as one of representation learning: by pretraining a single backbone on a very large corpus of unlabeled samples, the model captures statistical regularities of microbial ecosystems that can be reused across many downstream tasks.
By demonstrating zero- and few-shot transfer without retraining the backbone, Waypoint positions itself as an early example of the foundation-model approach applied to community-level microbiome data, complementing protein and genomic language models that operate on individual biological sequences.
Waypoint uses a GPT-2-style decoder-only transformer architecture trained with a self-supervised causal language modeling objective. The released family ranges from approximately 6 million to 170 million parameters. Pretraining uses the Atlas dataset, a corpus of over 539,000 microbiome samples assembled from MGnify. Models are evaluated on the Compass benchmark, which spans eight downstream tasks including biome classification, drug–microbiome interactions, drug degradation, and infant gut development. Across these tasks, the authors report that Waypoint achieves state-of-the-art results relative to the MGM baseline and to classical machine learning methods, with the learned representations transferring in zero- and few-shot settings rather than requiring the backbone to be retrained for each task.
Waypoint targets researchers working with metagenomic and microbiome data who need predictive models that generalize across studies. Potential use cases include classifying the biome of origin for a sample, predicting interactions between drugs and microbial communities, anticipating microbial drug degradation, and tracking developmental trajectories such as infant gut maturation. Because the pretrained backbone supports few-shot transfer, it is especially useful in settings where labeled microbiome data is limited, such as clinical or translational studies. Outpost Bio frames the models as infrastructure for personalized and predictive medicine, where microbiome state informs drug efficacy and nutrient absorption.
Waypoint is an early demonstration that the foundation-model paradigm, already transformative for protein and genomic sequences, can be extended to community-level microbiome data. By pairing the large Atlas pretraining corpus with the Compass evaluation suite, the work offers both a model family and a standardized benchmark that others can build on. As of the May 2026 preprint, Outpost Bio states a commitment to open science, but no public GitHub repository or HuggingFace release has been confirmed, and the results are reported in a non-peer-reviewed preprint—so independent reproduction and external validation remain open. If weights and code are released, Waypoint could lower the barrier to building predictive microbiome models across research and clinical settings.
Treloar, N. J., et al. (2026) Learning the Language of the Microbiome with Transformers. bioRxiv.
DOI: 10.64898/2026.05.02.722381