Most protein language models reason about one sequence at a time, ignoring the fact that proteins do not function in isolation but as members of a coordinated cellular system. ProteomeLM, developed by the Bitbol Lab at EPFL (Malbranke, Fruet, and Bitbol), takes a different stance: it treats an entire proteome as the unit of input, producing protein representations that are contextualized by every other protein in the organism. This proteome-scale view lets the model capture functional constraints, such as which proteins co-occur, co-vary, and physically partner, that are invisible to single-sequence models.
The model is trained in a self-supervised fashion to reconstruct masked protein embeddings using the surrounding proteomic context. A striking emergent property is that ProteomeLM's attention coefficients encode protein-protein interactions (PPIs) despite the model never seeing interaction labels during training. This enables zero-shot, interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution methods that depend on building large multiple sequence alignments.
First released as a bioRxiv preprint in August 2025 and subsequently published in PNAS (2026), ProteomeLM sits at the intersection of protein language modeling and systems biology, extending the foundation-model paradigm from individual proteins to the organization of whole proteomes.
proteomelm pip package and four HuggingFace checkpoints, with notebooks demonstrating PPI prediction workflows.ProteomeLM is a transformer encoder (DistilBERT-style architecture) trained from scratch on roughly 32,000 annotated proteomes spanning all domains of life, with orthologous groups drawn from OrthoDB. Rather than consuming raw amino acids, the model takes per-protein embeddings from an upstream protein language model (ESM-2 / ESM-C) as input tokens, then learns to reconstruct masked protein embeddings from their proteomic context, identifying the correct embedding among a candidate list of orthologs. The released checkpoints range from ProteomeLM-XS (5.66M parameters) through S (36.9M), M (112M), and L (328M). The associated training data is published as a HuggingFace dataset (the OrthoDB-derived ProteomeLM-dataset). On supervised benchmarks, ProteomeLM-PPI achieves state-of-the-art protein-protein interaction prediction across multiple species, and ProteomeLM-Ess generalizes gene-essentiality prediction across taxa.
ProteomeLM is aimed at computational and systems biologists who need fast, genome-wide maps of which proteins interact and which genes are essential. Because it runs zero-shot from a proteome without requiring deep alignments, it is well suited to screening interactomes in non-model organisms, prioritizing candidate complexes for experimental validation, and predicting essential genes to inform antimicrobial or synthetic-biology targets. A follow-up application from the same group fine-tunes ProteomeLM-S into ProteomeLM-HPI to predict host-pathogen interactions, improving discrimination on nine of ten pathogen datasets across viruses and bacteria including HSV-1, Y. pestis, S. enterica, C. trachomatis, and M. tuberculosis.
By reframing protein representation learning at the scale of the whole proteome, ProteomeLM offers a complementary axis to sequence- and structure-based models and shows that interaction information can emerge from self-supervised training without explicit labels. Its speed advantage over coevolution methods makes interactome-wide screening tractable for many genomes, and the open weights, pip package, and training data lower the barrier to adoption. The rapid emergence of downstream applications such as host-pathogen interaction prediction illustrates its value as a reusable proteome-scale foundation model, though, as with any embedding-driven predictor, results remain hypotheses that benefit from experimental confirmation.
Malbranke, C., et al. (2026) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa. bioRxiv.
DOI: 10.1073/pnas.2524201123