bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

BacPT

University of Florida

Bacterial proteome foundation model that learns contextualized gene and whole-genome representations from tens of thousands of complete genomes.

Released: March 2026

Most protein language models reason about one protein at a time, learning from amino acid sequences in isolation. But in bacteria, gene function is shaped by genomic context: operon structure, neighboring genes, and the broader organization of the proteome all carry functional signal that per-protein models discard. BacPT (Bacterial Proteome Transformer) is a foundation model designed to capture this context by modeling an entire bacterial proteome as an ordered sequence of proteins rather than as independent entities.

Developed by researchers at the University of Florida and posted to bioRxiv in March 2026, BacPT is trained in a self-supervised fashion on tens of thousands of complete bacterial genomes spanning diverse taxa. Instead of predicting masked amino acids, it learns to reconstruct corrupted per-protein ESM2 embeddings from their genomic context, producing contextualized gene embeddings and functionally rich whole-genome representations without task-specific labels.

The result is a model that sits a level above conventional protein language models, capturing local gene-neighborhood signal and genome-wide organization that improve a range of downstream functional predictions across enzymology, secondary metabolism, metabolism, and microbial ecology.

#Key Features

  • Proteome-as-sequence modeling: Represents a whole bacterial genome as an ordered series of proteins, letting the model learn gene-interaction and operon-level syntax that single-protein models miss.
  • Contextualized gene embeddings: Produces embeddings for each gene that reflect its genomic neighborhood, sharpening function prediction relative to context-free protein representations.
  • Whole-genome representations: Generates organism-level embeddings that support genome- and trait-level inference, including ecological interaction outcomes.
  • Self-supervised pretraining: Learns by reconstructing corrupted ESM2 protein embeddings, requiring no functional annotations during pretraining.
  • Broad functional transfer: Improves prediction of enzyme activities, biosynthetic gene clusters (BGCs), metabolic traits, and microbe-microbe interactions.

#Technical Details

BacPT is a transformer with a hidden size of 480, 10 layers, and 5 attention heads, using relative key-query position embeddings to model dependencies across sequences of up to 5,000 proteins per genome. Each gene is first encoded with ESM2 protein embeddings; the transformer then operates over these per-protein vectors to integrate local and genome-wide context. Training data were curated by clustering protein-family representations across 33,140 bacterial genomes, yielding a final pretraining set of 28,133 complete genomes selected for taxonomic and functional diversity. The pretraining objective reconstructs deliberately corrupted ESM2 embeddings from surrounding genomic context, and the learned representations are evaluated on downstream tasks spanning enzyme activity, BGC detection, metabolic trait prediction, and ecological interaction outcomes.

#Applications

BacPT is suited to microbiologists, genome annotators, and computational biologists working at the scale of complete bacterial genomes. Its contextualized embeddings can power functional annotation of poorly characterized genes, prioritization of biosynthetic gene clusters for natural-product discovery, prediction of metabolic capabilities, and modeling of interactions within microbial communities. Because representations are produced unsupervised, the model can be applied as a feature extractor across many downstream tasks without retraining from scratch.

#Impact

By extending the foundation-model paradigm from individual proteins to entire proteomes, BacPT targets functional signal that per-protein language models structurally cannot access. Its emphasis on genome-level context positions it within an emerging line of work on whole-genome bacterial language models, offering a reusable representation layer for microbial genomics. As a preprint, its benchmark advantages await peer review and broader independent evaluation, and at the time of writing no public code or weights release has been confirmed.

Tags

function_predictiongenome_representationenzyme_annotationtransformerfoundation_modelself_supervisedrepresentation_learningbacterial_genomicsproteomics