Bacterial proteome foundation model that learns contextualized gene and whole-genome representations from tens of thousands of complete genomes.
Most protein language models reason about one protein at a time, learning from amino acid sequences in isolation. But in bacteria, gene function is shaped by genomic context: operon structure, neighboring genes, and the broader organization of the proteome all carry functional signal that per-protein models discard. BacPT (Bacterial Proteome Transformer) is a foundation model designed to capture this context by modeling an entire bacterial proteome as an ordered sequence of proteins rather than as independent entities.
Developed by researchers at the University of Florida and posted to bioRxiv in March 2026, BacPT is trained in a self-supervised fashion on tens of thousands of complete bacterial genomes spanning diverse taxa. Instead of predicting masked amino acids, it learns to reconstruct corrupted per-protein ESM2 embeddings from their genomic context, producing contextualized gene embeddings and functionally rich whole-genome representations without task-specific labels.
The result is a model that sits a level above conventional protein language models, capturing local gene-neighborhood signal and genome-wide organization that improve a range of downstream functional predictions across enzymology, secondary metabolism, metabolism, and microbial ecology.
BacPT is a transformer with a hidden size of 480, 10 layers, and 5 attention heads, using relative key-query position embeddings to model dependencies across sequences of up to 5,000 proteins per genome. Each gene is first encoded with ESM2 protein embeddings; the transformer then operates over these per-protein vectors to integrate local and genome-wide context. Training data were curated by clustering protein-family representations across 33,140 bacterial genomes, yielding a final pretraining set of 28,133 complete genomes selected for taxonomic and functional diversity. The pretraining objective reconstructs deliberately corrupted ESM2 embeddings from surrounding genomic context, and the learned representations are evaluated on downstream tasks spanning enzyme activity, BGC detection, metabolic trait prediction, and ecological interaction outcomes.
BacPT is suited to microbiologists, genome annotators, and computational biologists working at the scale of complete bacterial genomes. Its contextualized embeddings can power functional annotation of poorly characterized genes, prioritization of biosynthetic gene clusters for natural-product discovery, prediction of metabolic capabilities, and modeling of interactions within microbial communities. Because representations are produced unsupervised, the model can be applied as a feature extractor across many downstream tasks without retraining from scratch.
By extending the foundation-model paradigm from individual proteins to entire proteomes, BacPT targets functional signal that per-protein language models structurally cannot access. Its emphasis on genome-level context positions it within an emerging line of work on whole-genome bacterial language models, offering a reusable representation layer for microbial genomics. As a preprint, its benchmark advantages await peer review and broader independent evaluation, and at the time of writing no public code or weights release has been confirmed.