bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

Protein Words Language Model

Tsinghua University

A hierarchical sequence-based protein representation that encodes proteins as discrete 'words' for zero-shot functional discovery and generative design.

Released: February 2026

Proteins are organized hierarchically: residues assemble into local structural and functional modules that recur across the proteome, much as letters form words. The Protein Words Language Model (internally "ProtWord"), developed in Guangshuo Ou's lab at Tsinghua University's Department of Basic Medical Sciences and posted to bioRxiv in February 2026, operationalizes this analogy. Rather than modeling proteins one residue at a time, it discretizes protein space into a learnable vocabulary derived from the evolutionary record, encoding each protein as a sequence of discrete "words."

This hierarchical, sequence-based view lets the model capture higher-order structural and functional signals that residue-level language models tend to miss, while remaining purely sequence-driven (no experimental structures are required at inference). The authors position the discrete vocabulary as both an analytical lens on protein organization and evolution and a generative substrate for design.

The work is notable for pairing a representation-learning contribution with direct wet-lab validation. The same vocabulary that supports zero-shot functional inference also drives autoregressive generation of synthetic proteins, and both capabilities are tested experimentally rather than left as in-silico benchmarks.

#Key Features

  • Discrete protein vocabulary: Proteins are tokenized into a learned set of recurring "words" via vector quantization, yielding a compact, interpretable alphabet that exposes higher-order organization beyond single residues.
  • Hierarchical, sequence-only representation: The model operates on sequence alone yet recovers module-level structural and functional patterns, making it applicable to uncharacterized proteins without solved structures.
  • Zero-shot functional discovery: Embeddings transfer directly to remote homology detection and mutation effect prediction, and the same semantic axis surfaced ADMAP1, a previously uncharacterized regulator.
  • Autoregressive generative design: A generative model over the word vocabulary produces novel synthetic proteins, demonstrated by designs that retain native fold architecture at low sequence identity.
  • In-vivo validation: Predictions are confirmed experimentally, including a CRISPR-Cas9 mouse knockout and cell-based functional assays.

#Technical Details

The representation is built with a vector-quantized variational autoencoder (VQ-VAE) that maps protein sequence into a discrete codebook of "words," pretrained on the broad evolutionary diversity of UniRef50. On standard benchmarks, the learned representation is highly competitive with established residue-level baselines for remote homology detection and mutation effect prediction. For generation, an autoregressive model over the word vocabulary was fine-tuned on homologs of the F-actin-severing protein cofilin; it produced synthetic variants that preserved the characteristic cofilin-fold architecture despite sharing less than 60% sequence identity with any known natural protein. Of these, several designs disrupted the intracellular actin filament network in cells, consistent with native cofilin activity. The preprint does not report a published parameter count. Pretrained weights are released on Zenodo (record 18640019), including the ProtWord-150M backbone, the VQ-VAE codebook (8,192 tokens), and the fine-tuned latent GPT, under a "ProtWord Open RAIL-M" use-restriction license; the accompanying data (evolutionary frequency matrices for 54 species, variant-effect evaluation sets, and CASA tracking data) is released under CC BY 4.0. The GitHub code repository cited in the preprint was not publicly accessible at the time of review, so training and inference code remains unavailable.

#Applications

The model serves protein biologists and computational researchers who need to prioritize and characterize uncharacterized proteins or design functional variants from sequence alone. The zero-shot discovery workflow identified ADMAP1 as a regulator of sperm motility, validated by CRISPR-Cas9 knockout mouse and by immunofluorescence showing co-localization with microtubules and the ciliary marker ARL13B. The generative workflow targets enzyme- and cytoskeleton-modulating design, producing synthetic actin-remodeling proteins that retain F-actin severing activity, illustrating use cases from functional annotation to de novo protein engineering.

#Impact

By framing proteins as sequences of discrete words, this work advances a hierarchical alternative to residue-level protein language models and demonstrates that such representations can drive genuine biological discovery rather than benchmark gains alone. The combination of a novel ciliary protein discovered and validated in vivo with functional de novo designs validated in cells is an unusually complete sequence-to-phenotype loop for a representation-learning paper. Pretrained weights and evaluation data are released on Zenodo, though the weights carry a use-restriction (RAIL-M) license and the cited code repository was not publicly accessible at review, so fully open reproduction still awaits an accessible implementation. Even so, the approach offers a compelling template for connecting interpretable protein vocabularies to both discovery and design.

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Guo, Z., et al. (2026) Hierarchical latent representations reveal protein organization for functional discovery and design. bioRxiv.

DOI: 10.64898/2026.02.14.705947

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
24Closed
Usability — can I run it?11
Reproducibility — can I retrain it?39
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

autoregressivecell_biologyde_novo_designgenerativeprotein_designproteomicsrepresentation_learningself_supervisedvariant_effect_predictionvq_vaezero_shot

Resources

Research PaperDatasetModel Weights (Zenodo)