ProteomeLM

Proteome-scale protein language model whose representations enable zero-shot protein-protein interaction and gene essentiality prediction.

Released: August 2025

Parameters: 328 Million

Most protein language models reason about one sequence at a time, ignoring the fact that proteins do not function in isolation but as members of a coordinated cellular system. ProteomeLM, developed by the Bitbol Lab at EPFL (Malbranke, Fruet, and Bitbol), takes a different stance: it treats an entire proteome as the unit of input, producing protein representations that are contextualized by every other protein in the organism. This proteome-scale view lets the model capture functional constraints, such as which proteins co-occur, co-vary, and physically partner, that are invisible to single-sequence models.

The model is trained in a self-supervised fashion to reconstruct masked protein embeddings using the surrounding proteomic context. A striking emergent property is that ProteomeLM's attention coefficients encode protein-protein interactions (PPIs) despite the model never seeing interaction labels during training. This enables zero-shot, interactome-wide PPI screening that is substantially more accurate, and orders of magnitude faster, than amino-acid coevolution methods that depend on building large multiple sequence alignments.

First released as a bioRxiv preprint in August 2025 and subsequently published in PNAS (2026), ProteomeLM sits at the intersection of protein language modeling and systems biology, extending the foundation-model paradigm from individual proteins to the organization of whole proteomes.

Key Features

Proteome-scale context: Processes the full set of proteins in an organism at once, yielding representations that reflect proteome-wide functional constraints rather than properties of an isolated sequence.
Zero-shot PPI detection: Attention coefficients spontaneously capture protein-protein interactions without any interaction supervision, supporting interactome-wide screening orders of magnitude faster than coevolution-based approaches.
Cross-taxa generalization: Trained across the tree of life, the model transfers to eukaryotic and prokaryotic proteomes alike, including organisms unseen during training.
Supervised task heads: ProteomeLM-PPI combines embeddings and attention coefficients for state-of-the-art supervised PPI prediction, while ProteomeLM-Ess predicts gene essentiality across diverse taxa.
Four model sizes: Released as XS, S, M, and L checkpoints (5.66M to 328M parameters), letting users trade off accuracy against compute and memory.
Easy access: Distributed through a proteomelm pip package and four HuggingFace checkpoints, with notebooks demonstrating PPI prediction workflows.

Technical Details

ProteomeLM is a transformer encoder (DistilBERT-style architecture) trained from scratch on roughly 32,000 annotated proteomes spanning all domains of life, with orthologous groups drawn from OrthoDB. Rather than consuming raw amino acids, the model takes per-protein embeddings from an upstream protein language model (ESM-2 / ESM-C) as input tokens, then learns to reconstruct masked protein embeddings from their proteomic context, identifying the correct embedding among a candidate list of orthologs. The released checkpoints range from ProteomeLM-XS (5.66M parameters) through S (36.9M), M (112M), and L (328M). The associated training data is published as a HuggingFace dataset (the OrthoDB-derived ProteomeLM-dataset). On supervised benchmarks, ProteomeLM-PPI achieves state-of-the-art protein-protein interaction prediction across multiple species, and ProteomeLM-Ess generalizes gene-essentiality prediction across taxa.

Applications

ProteomeLM is aimed at computational and systems biologists who need fast, genome-wide maps of which proteins interact and which genes are essential. Because it runs zero-shot from a proteome without requiring deep alignments, it is well suited to screening interactomes in non-model organisms, prioritizing candidate complexes for experimental validation, and predicting essential genes to inform antimicrobial or synthetic-biology targets. A follow-up application from the same group fine-tunes ProteomeLM-S into ProteomeLM-HPI to predict host-pathogen interactions, improving discrimination on nine of ten pathogen datasets across viruses and bacteria including HSV-1, Y. pestis, S. enterica, C. trachomatis, and M. tuberculosis.

Impact

By reframing protein representation learning at the scale of the whole proteome, ProteomeLM offers a complementary axis to sequence- and structure-based models and shows that interaction information can emerge from self-supervised training without explicit labels. Its speed advantage over coevolution methods makes interactome-wide screening tractable for many genomes, and the open weights, pip package, and training data lower the barrier to adoption. The rapid emergence of downstream applications such as host-pathogen interaction prediction illustrates its value as a reusable proteome-scale foundation model, though, as with any embedding-driven predictor, results remain hypotheses that benefit from experimental confirmation.

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

Malbranke, C., et al. (2026) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa. bioRxiv.

DOI: 10.1073/pnas.2524201123

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Preprint

Malbranke, C., et al. (2025) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa. openRxiv.

DOI: 10.1101/2025.08.01.668221

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations928

Influential97

References55

GitHub

Stars36

Forks7

Open Issues6

Contributors2

Last Push1mo ago

LanguageJupyter Notebook

LicenseApache-2.0

HuggingFace

Downloads43

Likes0

Last Modified11mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

64Partial

Usability — can I run it?69

Reproducibility — can I retrain it?54

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper HuggingFace Model Dataset

Key Features

Proteome-scale context: Processes the full set of proteins in an organism at once, yielding representations that reflect proteome-wide functional constraints rather than properties of an isolated sequence.

Zero-shot PPI detection: Attention coefficients spontaneously capture protein-protein interactions without any interaction supervision, supporting interactome-wide screening orders of magnitude faster than coevolution-based approaches.

Cross-taxa generalization: Trained across the tree of life, the model transfers to eukaryotic and prokaryotic proteomes alike, including organisms unseen during training.

Supervised task heads: ProteomeLM-PPI combines embeddings and attention coefficients for state-of-the-art supervised PPI prediction, while ProteomeLM-Ess predicts gene essentiality across diverse taxa.

Four model sizes: Released as XS, S, M, and L checkpoints (5.66M to 328M parameters), letting users trade off accuracy against compute and memory.

Easy access: Distributed through a proteomelm pip package and four HuggingFace checkpoints, with notebooks demonstrating PPI prediction workflows.

Technical Details

Applications

Impact

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

Malbranke, C., et al. (2026) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa. bioRxiv.

DOI: 10.1073/pnas.2524201123

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Preprint

Malbranke, C., et al. (2025) ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa. openRxiv.

DOI: 10.1101/2025.08.01.668221

ProteomeLM

Key Features

Technical Details

Applications

Impact

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProteomeLM

Key Features

Technical Details

Applications

Impact

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProteomeLM

#Key Features

#Technical Details

#Applications

#Impact

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProteomeLM

#Key Features

#Technical Details

#Applications

#Impact

Citations

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein–protein interactions and gene essentiality across taxa

ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact