bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene

Susagi

University of Zurich

A permutation-invariant denoising transformer trained on ~2 million bacterial community samples to learn member-level stability scores and predict microbiome composition dynamics zero-shot.

Released: May 2026

Susagi (Set Unsupervised Assessment of Genetic Imposters) is a "microbiome world model" that learns the rules governing which bacteria can coexist in a community. Rather than predicting the structure of a single organism, it treats a microbiome sample as an unordered set of taxa and asks a self-supervised question: given the members observed in a community, which ones genuinely belong and which are imposters? By learning to separate true members from plausible-looking decoys across millions of samples, the model induces a member-level "stability" score that reflects how well each taxon fits its community context.

The model was developed by Marco Peluso, Janko Tackmann, and Christian von Mering at the University of Zurich (and the SIB Swiss Institute of Bioinformatics), and released as a bioRxiv preprint in May 2026. It addresses a long-standing gap in microbial ecology: traditional machine-learning approaches struggle to generalize across cohorts, biomes, and hosts because community composition is high-dimensional, sparse, and context-dependent. Susagi instead leverages the scale of the MicrobeAtlas reference resource to learn community structure in a way that transfers without retraining.

Its central claim is zero-shot prediction of community composition dynamics: applied to three challenging microbiome settings where conventional ML pipelines underperform, the learned stability scores predict ecological behavior without any task-specific fine-tuning. This positions Susagi alongside emerging "foundation model" approaches in genomics, but framed around community-level set modeling rather than single-sequence representation.

#Key Features

  • Permutation-invariant set encoding: A transformer encoder without positional encodings treats each community as an unordered set, so predictions do not depend on the arbitrary order of taxa in a sample.
  • Imposter (denoising) objective: Training constructs negative "imposter" OTUs that are embedding-space neighbors but absent from the true community, forcing the model to learn which members are contextually plausible.
  • Member-level stability scores: The model outputs a per-taxon scalar that quantifies how well each member fits its community, enabling dropout, colonization, and trajectory analyses.
  • Sequence-grounded inputs via ProkBERT: Taxa are represented by ProkBERT embeddings of SSU rRNA representative sequences, mapping every community member into a shared latent space.
  • Optional metadata conditioning: Text-aware checkpoints incorporate biome and host metadata embeddings alongside DNA, while DNA-only checkpoints operate on sequence information alone.

#Technical Details

Susagi is a compact transformer-encoder operating on sets of embeddings. Each OTU is represented by a 384-dimensional ProkBERT-derived embedding (97% OTU level, from the MicrobeAtlas MAPseq reference of roughly 100,000 representative SSU rRNA sequences); optional text metadata enters as 1536-dimensional embeddings. Type-specific linear layers project these into a shared model space, and a positional-encoding-free encoder produces a per-token stability logit. The training corpus comprises approximately 2 million bacterial community samples from MicrobeAtlas. Four checkpoints are released spanning two sizes and two modalities: small (d_model 20, 3 layers) and large (d_model 100, 5 layers), each in DNA-only and DNA+text variants. Negatives are generated by sampling roughly one-third of a community's size as imposters, with difficulty tuned via minimum cosine-similarity cutoffs. Evaluation spans three settings—gingivitis dropout prediction, DIABIMMUNE infant trajectory analysis, and cross-country IBS prediction—where the zero-shot stability scores outperform conventional baselines.

#Applications

Susagi is aimed at microbial ecologists and computational biologists studying community assembly, stability, and dynamics from amplicon (SSU rRNA) survey data. Because the stability score is produced zero-shot, researchers can probe questions such as which taxa are likely to drop out, which could colonize a community, or how an infant gut microbiome trajectory will unfold—without assembling a cohort-specific training set. An interactive HuggingFace Space lets users explore predictions, and the released checkpoints support downstream rollout and perturbation experiments. In practice, inference depends on upstream tooling: input taxa must be embedded with ProkBERT and mapped to MicrobeAtlas OTUs, so the model is best suited to workflows already grounded in that reference.

#Impact

By reframing microbiome modeling as permutation-invariant set denoising over millions of samples, Susagi offers a route to cross-cohort generalization that has eluded standard supervised pipelines. Its demonstration of zero-shot transfer across gingivitis, infant gut, and IBS datasets suggests that large-scale self-supervised pretraining can capture transferable ecological structure, an idea that could influence how community dynamics are predicted in clinical and environmental microbiome research. As a preprint, its results await peer review and broader independent benchmarking. Notably, while the preprint is released under CC BY, the model weights on GitHub and HuggingFace carry no specified license, which users should consider before redistribution or commercial use.

Citation

Susagi: A Microbiome World Model

Peluso, M., et al. (2026) Susagi: A Microbiome World Model. bioRxiv.

DOI: 10.64898/2026.05.07.723428

Openness

Unclassified
Restrictive license on core components

Tags

community_dynamics_predictiondenoisingmetagenomicsmicrobiomerepresentation_learningself_supervisedset_transformertransformerzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDemoDataset