MetagenBERT

Annotation-free metagenome embedding pipeline that encodes raw DNA reads with genomic language models and pools them via FAISS k-means clustering.

Released: January 2026

MetagenBERT addresses a long-standing bottleneck in microbiome machine learning: turning a metagenome—millions of short DNA reads sampled from a microbial community—into a single fixed-length representation suitable for downstream prediction. The conventional approach first maps reads to reference genomes to build species-abundance tables, which discards information about unknown or unreferenced organisms and depends heavily on the quality of taxonomic databases. MetagenBERT instead generates metagenome embeddings directly from raw DNA reads, with no taxonomic or functional annotation step.

The method, introduced in a January 2026 preprint by Gaspar Roy, Eugeni Belda, Baptiste Hennecart, Yann Chevaleyre, Edi Prifti, and Jean-Daniel Zucker, repurposes pretrained genomic language models—DNABERT-2 and the microbiome-specialized DNABERTMS—as read-level encoders. Per-read embeddings are then aggregated across an entire sample using FAISS-accelerated k-means clustering, producing cluster-abundance vectors that summarize how reads distribute across learned sequence clusters. This yields an annotation-free "fingerprint" of the metagenome.

MetagenBERT is positioned as a step toward foundation-model-style metagenome representation, demonstrating that embeddings learned from sequence alone can rival traditional abundance-based features for disease classification.

Key Features

Annotation-free metagenome embeddings: Representations are built directly from raw DNA reads, removing the dependency on reference genomes and taxonomic or functional annotation pipelines.
Genomic language-model encoders: Uses pretrained DNABERT-2 and the microbiome-specialized DNABERTMS to embed individual reads, transferring general genomic knowledge to the metagenomic setting.
FAISS k-means aggregation: Read embeddings are clustered with FAISS-accelerated k-means, and the resulting cluster-abundance vectors serve as the per-sample feature representation. The clustering step is fit per dataset, keeping it lightweight.
Cross-cohort transfer: A variant ("MetagenBERT Glob Mcardis") pretrained on the MetaCardis cohort transfers to unseen phenotypes, hinting at foundation-model-style generalization.
Robust to read subsampling: Maintains competitive performance even when using only about 10% of available reads.

Technical Details

MetagenBERT combines transformer-based genomic language models (DNABERT-2 and DNABERTMS, both BERT-style encoders) with a FAISS k-means clustering aggregation strategy. Each read is embedded by the language model, and k-means assigns reads to clusters whose abundances form the metagenome's representation; this clustering is fit per dataset rather than learned globally, making it a lightweight, non-parametric aggregation layer. The approach was evaluated on five gut-microbiome disease datasets—Cirrhosis, Type 2 Diabetes, Obesity, Inflammatory Bowel Disease (IBD), and Colorectal Cancer—where it achieved competitive or superior AUC relative to species-abundance baselines. The cross-cohort "Glob Mcardis" variant, pretrained on the MetaCardis cohort, transferred to other datasets, and performance remained strong when only 10% of reads were used, suggesting the representation is both data-efficient and generalizable.

Applications

MetagenBERT is intended for researchers building predictive models from shotgun metagenomic sequencing—most directly, classifying disease status from gut microbiome samples such as cirrhosis, type 2 diabetes, obesity, IBD, and colorectal cancer. By removing the reference-mapping and annotation steps, it is particularly attractive for communities rich in uncharacterized organisms and for settings where building or maintaining taxonomic databases is impractical. Its robustness to read subsampling also makes it appealing where sequencing depth or compute is constrained.

Impact

MetagenBERT illustrates how genomic foundation models can be repurposed as general-purpose encoders for whole-metagenome representation, moving the field beyond reference-dependent abundance tables. Its competitive results across five disease cohorts and its cross-cohort transfer suggest a viable path toward metagenome foundation models. As a recent preprint, its influence is still emerging, and no public code, weights, or license has been located at the time of writing, which currently limits reproducibility and independent benchmarking.

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Preprint

Roy, G., et al. (2026) MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation. arXiv.org.

DOI: 10.48550/arXiv.2601.03295

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References38

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

22Closed

Usability — can I run it?15

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Annotation-free metagenome embeddings: Representations are built directly from raw DNA reads, removing the dependency on reference genomes and taxonomic or functional annotation pipelines.

Genomic language-model encoders: Uses pretrained DNABERT-2 and the microbiome-specialized DNABERTMS to embed individual reads, transferring general genomic knowledge to the metagenomic setting.

FAISS k-means aggregation: Read embeddings are clustered with FAISS-accelerated k-means, and the resulting cluster-abundance vectors serve as the per-sample feature representation. The clustering step is fit per dataset, keeping it lightweight.

Cross-cohort transfer: A variant ("MetagenBERT Glob Mcardis") pretrained on the MetaCardis cohort transfers to unseen phenotypes, hinting at foundation-model-style generalization.

Robust to read subsampling: Maintains competitive performance even when using only about 10% of available reads.

Technical Details

Applications

Impact

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Preprint

Roy, G., et al. (2026) MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation. arXiv.org.

DOI: 10.48550/arXiv.2601.03295

MetagenBERT

Key Features

Technical Details

Applications

Impact

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

MetagenBERT

Key Features

Technical Details

Applications

Impact

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

MetagenBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

MetagenBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact