Waypoint

Microbiome foundation models that treat microbial community composition as a language, enabling zero- and few-shot transfer across prediction tasks.

Released: May 2026

Waypoint is a family of transformer-based foundation models that treat the human microbiome as a "language," learning generalizable representations of microbial community composition through self-supervised pretraining. Developed by Outpost Bio, a London-based company building computational infrastructure for human microbiology, the models were introduced in the May 2026 bioRxiv preprint "Learning the Language of the Microbiome with Transformers." The core idea is to adapt the causal language modeling paradigm that powers large language models to microbiome data, where the "tokens" describe the microbial taxa and abundances that make up a community sample.

Microbiome research has historically relied on classical statistical and machine learning methods trained from scratch for each specific question—biome classification, host phenotype prediction, or community dynamics. These approaches struggle to transfer across studies and require labeled data that is often scarce. Waypoint reframes the problem as one of representation learning: by pretraining a single backbone on a very large corpus of unlabeled samples, the model captures statistical regularities of microbial ecosystems that can be reused across many downstream tasks.

By demonstrating zero- and few-shot transfer without retraining the backbone, Waypoint positions itself as an early example of the foundation-model approach applied to community-level microbiome data, complementing protein and genomic language models that operate on individual biological sequences.

Key Features

Microbiome-as-language pretraining: Treats microbial community samples as sequences and applies GPT-2-style causal language modeling, learning structure in compositional data through next-token prediction.
Family of model scales: Spans roughly 6M to 170M parameters, letting users trade compute for capacity and study how performance scales with model size.
Zero- and few-shot transfer: A single frozen backbone supports multiple downstream tasks without task-specific retraining, reducing the labeled-data burden for new applications.
Trained on a large microbiome corpus: Pretrained on the Atlas dataset of more than 539,000 samples drawn from MGnify, one of the largest public metagenomics resources.
Standardized evaluation: Benchmarked on Compass, a suite of eight downstream tasks designed to measure generalization across diverse microbiome prediction problems.

Technical Details

Waypoint uses a GPT-2-style decoder-only transformer architecture trained with a self-supervised causal language modeling objective. The released family ranges from approximately 6 million to 170 million parameters. Pretraining uses the Atlas dataset, a corpus of over 539,000 microbiome samples assembled from MGnify. Models are evaluated on the Compass benchmark, which spans eight downstream tasks including biome classification, drug–microbiome interactions, drug degradation, and infant gut development. Across these tasks, the authors report that Waypoint achieves state-of-the-art results relative to the MGM baseline and to classical machine learning methods, with the learned representations transferring in zero- and few-shot settings rather than requiring the backbone to be retrained for each task.

Applications

Waypoint targets researchers working with metagenomic and microbiome data who need predictive models that generalize across studies. Potential use cases include classifying the biome of origin for a sample, predicting interactions between drugs and microbial communities, anticipating microbial drug degradation, and tracking developmental trajectories such as infant gut maturation. Because the pretrained backbone supports few-shot transfer, it is especially useful in settings where labeled microbiome data is limited, such as clinical or translational studies. Outpost Bio frames the models as infrastructure for personalized and predictive medicine, where microbiome state informs drug efficacy and nutrient absorption.

Impact

Waypoint is an early demonstration that the foundation-model paradigm, already transformative for protein and genomic sequences, can be extended to community-level microbiome data. By pairing the large Atlas pretraining corpus with the Compass evaluation suite, the work offers both a model family and a standardized benchmark that others can build on. As of the May 2026 preprint, Outpost Bio states a commitment to open science, but no public GitHub repository or HuggingFace release has been confirmed, and the results are reported in a non-peer-reviewed preprint—so independent reproduction and external validation remain open. If weights and code are released, Waypoint could lower the barrier to building predictive microbiome models across research and clinical settings.

Citation

Learning the Language of the Microbiome with Transformers

Treloar, N. J., et al. (2026) Learning the Language of the Microbiome with Transformers. bioRxiv.

DOI: 10.64898/2026.05.02.722381

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations29

Influential2

References54

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

23Closed

Usability — can I run it?15

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper Official Website

Key Features

Microbiome-as-language pretraining: Treats microbial community samples as sequences and applies GPT-2-style causal language modeling, learning structure in compositional data through next-token prediction.

Family of model scales: Spans roughly 6M to 170M parameters, letting users trade compute for capacity and study how performance scales with model size.

Zero- and few-shot transfer: A single frozen backbone supports multiple downstream tasks without task-specific retraining, reducing the labeled-data burden for new applications.

Trained on a large microbiome corpus: Pretrained on the Atlas dataset of more than 539,000 samples drawn from MGnify, one of the largest public metagenomics resources.

Standardized evaluation: Benchmarked on Compass, a suite of eight downstream tasks designed to measure generalization across diverse microbiome prediction problems.

Technical Details

Applications

Impact

Waypoint

Key Features

Technical Details

Applications

Impact

Citation

Learning the Language of the Microbiome with Transformers

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Waypoint

Key Features

Technical Details

Applications

Impact

Citation

Learning the Language of the Microbiome with Transformers

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Waypoint

#Key Features

#Technical Details

#Applications

#Impact

Citation

Learning the Language of the Microbiome with Transformers

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Waypoint

#Key Features

#Technical Details

#Applications

#Impact

Citation

Learning the Language of the Microbiome with Transformers

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact