bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

ProteomeLM

EPFL

A transformer that reasons over entire proteomes to produce context-aware protein representations for zero-shot protein-protein interaction and gene essentiality prediction.

Released: August 2025
Parameters: 328 Million

Most protein language models reason about one sequence at a time, ignoring the fact that proteins do not function in isolation but as members of a coordinated cellular system. ProteomeLM, developed by the Bitbol Lab at EPFL (Malbranke, Fruet, and Bitbol), takes a different stance: it treats an entire proteome as the unit of input, producing protein representations that are contextualized by every other protein in the organism. This proteome-scale view lets the model capture functional constraints, such as which proteins co-occur, co-vary, and physically partner, that are invisible to single-sequence models.

The model is trained in a self-supervised fashion to reconstruct masked protein embeddings using the surrounding proteomic context. A striking emergent property is that ProteomeLM's attention coefficients encode protein-protein interactions (PPIs) despite the model never seeing interaction labels during training. This enables zero-shot, interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution methods that depend on building large multiple sequence alignments.

First released as a bioRxiv preprint in August 2025 and subsequently published in PNAS (2026), ProteomeLM sits at the intersection of protein language modeling and systems biology, extending the foundation-model paradigm from individual proteins to the organization of whole proteomes.

#Key Features

  • Proteome-scale context: Processes the full set of proteins in an organism at once, yielding representations that reflect proteome-wide functional constraints rather than properties of an isolated sequence.
  • Zero-shot PPI detection: Attention coefficients spontaneously capture protein-protein interactions without any interaction supervision, supporting interactome-wide screening orders of magnitude faster than coevolution-based approaches.
  • Cross-taxa generalization: Trained across the tree of life, the model transfers to eukaryotic and prokaryotic proteomes alike, including organisms unseen during training.
  • Supervised task heads: ProteomeLM-PPI combines embeddings and attention coefficients for state-of-the-art supervised PPI prediction, while ProteomeLM-Ess predicts gene essentiality across diverse taxa.
  • Four model sizes: Released as XS, S, M, and L checkpoints (5.66M to 328M parameters), letting users trade off accuracy against compute and memory.
  • Easy access: Distributed through a proteomelm pip package and four HuggingFace checkpoints, with notebooks demonstrating PPI prediction workflows.

#Technical Details

ProteomeLM is a transformer encoder (DistilBERT-style architecture) trained from scratch on roughly 32,000 annotated proteomes spanning all domains of life, with orthologous groups drawn from OrthoDB. Rather than consuming raw amino acids, the model takes per-protein embeddings from an upstream protein language model (ESM-2 / ESM-C) as input tokens, then learns to reconstruct masked protein embeddings from their proteomic context, identifying the correct embedding among a candidate list of orthologs. The released checkpoints range from ProteomeLM-XS (5.66M parameters) through S (36.9M), M (112M), and L (328M). The associated training data is published as a HuggingFace dataset (the OrthoDB-derived ProteomeLM-dataset). On supervised benchmarks, ProteomeLM-PPI achieves state-of-the-art protein-protein interaction prediction across multiple species, and ProteomeLM-Ess generalizes gene-essentiality prediction across taxa.

#Applications

ProteomeLM is aimed at computational and systems biologists who need fast, genome-wide maps of which proteins interact and which genes are essential. Because it runs zero-shot from a proteome without requiring deep alignments, it is well suited to screening interactomes in non-model organisms, prioritizing candidate complexes for experimental validation, and predicting essential genes to inform antimicrobial or synthetic-biology targets. A follow-up application from the same group fine-tunes ProteomeLM-S into ProteomeLM-HPI to predict host-pathogen interactions, improving discrimination on nine of ten pathogen datasets across viruses and bacteria including HSV-1, Y. pestis, S. enterica, C. trachomatis, and M. tuberculosis.

#Impact

By reframing protein representation learning at the scale of the whole proteome, ProteomeLM offers a complementary axis to sequence- and structure-based models and shows that interaction information can emerge from self-supervised training without explicit labels. Its speed advantage over coevolution methods makes interactome-wide screening tractable for many genomes, and the open weights, pip package, and training data lower the barrier to adoption. The rapid emergence of downstream applications such as host-pathogen interaction prediction illustrates its value as a reusable proteome-scale foundation model, though, as with any embedding-driven predictor, results remain hypotheses that benefit from experimental confirmation.

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

Malbranke, C., et al. (2026) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa. bioRxiv.

DOI: 10.1073/pnas.2524201123

Preprint

DOI: 10.1101/2025.08.01.668221

DOI: 10.1101/2025.08.01.668221

Openness

Unclassified
Restrictive license on core components

Tags

foundation_modelgene_essentiality_predictionhost_pathogen_interaction_predictionprotein_protein_interaction_predictionproteomicsself_supervisedtransformerzero_shot

Resources

GitHub RepositoryResearch PaperResearch PaperHuggingFace ModelDataset