A pipeline that builds whole-metagenome embeddings directly from raw DNA reads using genomic language models and FAISS k-means clustering, without taxonomic or functional annotation.
MetagenBERT addresses a long-standing bottleneck in microbiome machine learning: turning a metagenome—millions of short DNA reads sampled from a microbial community—into a single fixed-length representation suitable for downstream prediction. The conventional approach first maps reads to reference genomes to build species-abundance tables, which discards information about unknown or unreferenced organisms and depends heavily on the quality of taxonomic databases. MetagenBERT instead generates metagenome embeddings directly from raw DNA reads, with no taxonomic or functional annotation step.
The method, introduced in a January 2026 preprint by Gaspar Roy, Eugeni Belda, Baptiste Hennecart, Yann Chevaleyre, Edi Prifti, and Jean-Daniel Zucker, repurposes pretrained genomic language models—DNABERT-2 and the microbiome-specialized DNABERTMS—as read-level encoders. Per-read embeddings are then aggregated across an entire sample using FAISS-accelerated k-means clustering, producing cluster-abundance vectors that summarize how reads distribute across learned sequence clusters. This yields an annotation-free "fingerprint" of the metagenome.
MetagenBERT is positioned as a step toward foundation-model-style metagenome representation, demonstrating that embeddings learned from sequence alone can rival traditional abundance-based features for disease classification.
MetagenBERT combines transformer-based genomic language models (DNABERT-2 and DNABERTMS, both BERT-style encoders) with a FAISS k-means clustering aggregation strategy. Each read is embedded by the language model, and k-means assigns reads to clusters whose abundances form the metagenome's representation; this clustering is fit per dataset rather than learned globally, making it a lightweight, non-parametric aggregation layer. The approach was evaluated on five gut-microbiome disease datasets—Cirrhosis, Type 2 Diabetes, Obesity, Inflammatory Bowel Disease (IBD), and Colorectal Cancer—where it achieved competitive or superior AUC relative to species-abundance baselines. The cross-cohort "Glob Mcardis" variant, pretrained on the MetaCardis cohort, transferred to other datasets, and performance remained strong when only 10% of reads were used, suggesting the representation is both data-efficient and generalizable.
MetagenBERT is intended for researchers building predictive models from shotgun metagenomic sequencing—most directly, classifying disease status from gut microbiome samples such as cirrhosis, type 2 diabetes, obesity, IBD, and colorectal cancer. By removing the reference-mapping and annotation steps, it is particularly attractive for communities rich in uncharacterized organisms and for settings where building or maintaining taxonomic databases is impractical. Its robustness to read subsampling also makes it appealing where sequencing depth or compute is constrained.
MetagenBERT illustrates how genomic foundation models can be repurposed as general-purpose encoders for whole-metagenome representation, moving the field beyond reference-dependent abundance tables. Its competitive results across five disease cohorts and its cross-cohort transfer suggest a viable path toward metagenome foundation models. As a recent preprint, its influence is still emerging, and no public code, weights, or license has been located at the time of writing, which currently limits reproducibility and independent benchmarking.