Protein Words Language Model

Protein language model that encodes sequences as discrete words from a learned vocabulary for zero-shot function inference and protein design.

Released: February 2026

Proteins are organized hierarchically: residues assemble into local structural and functional modules that recur across the proteome, much as letters form words. The Protein Words Language Model (internally "ProtWord"), developed in Guangshuo Ou's lab at Tsinghua University's Department of Basic Medical Sciences and posted to bioRxiv in February 2026, operationalizes this analogy. Rather than modeling proteins one residue at a time, it discretizes protein space into a learnable vocabulary derived from the evolutionary record, encoding each protein as a sequence of discrete "words."

This hierarchical, sequence-based view lets the model capture higher-order structural and functional signals that residue-level language models tend to miss, while remaining purely sequence-driven (no experimental structures are required at inference). The authors position the discrete vocabulary as both an analytical lens on protein organization and evolution and a generative substrate for design.

The work is notable for pairing a representation-learning contribution with direct wet-lab validation. The same vocabulary that supports zero-shot functional inference also drives autoregressive generation of synthetic proteins, and both capabilities are tested experimentally rather than left as in-silico benchmarks.

Key Features

Discrete protein vocabulary: Proteins are tokenized into a learned set of recurring "words" via vector quantization, yielding a compact, interpretable alphabet that exposes higher-order organization beyond single residues.
Hierarchical, sequence-only representation: The model operates on sequence alone yet recovers module-level structural and functional patterns, making it applicable to uncharacterized proteins without solved structures.
Zero-shot functional discovery: Embeddings transfer directly to remote homology detection and mutation effect prediction, and the same semantic axis surfaced ADMAP1, a previously uncharacterized regulator.
Autoregressive generative design: A generative model over the word vocabulary produces novel synthetic proteins, demonstrated by designs that retain native fold architecture at low sequence identity.
In-vivo validation: Predictions are confirmed experimentally, including a CRISPR-Cas9 mouse knockout and cell-based functional assays.

Technical Details

The representation is built with a vector-quantized variational autoencoder (VQ-VAE) that maps protein sequence into a discrete codebook of "words," pretrained on the broad evolutionary diversity of UniRef50. On standard benchmarks, the learned representation is highly competitive with established residue-level baselines for remote homology detection and mutation effect prediction. For generation, an autoregressive model over the word vocabulary was fine-tuned on homologs of the F-actin-severing protein cofilin; it produced synthetic variants that preserved the characteristic cofilin-fold architecture despite sharing less than 60% sequence identity with any known natural protein. Of these, several designs disrupted the intracellular actin filament network in cells, consistent with native cofilin activity. The preprint does not report a published parameter count. Pretrained weights are released on Zenodo (record 18640019), including the ProtWord-150M backbone, the VQ-VAE codebook (8,192 tokens), and the fine-tuned latent GPT, under a "ProtWord Open RAIL-M" use-restriction license; the accompanying data (evolutionary frequency matrices for 54 species, variant-effect evaluation sets, and CASA tracking data) is released under CC BY 4.0. The GitHub code repository cited in the preprint was not publicly accessible at the time of review, so training and inference code remains unavailable.

Applications

The model serves protein biologists and computational researchers who need to prioritize and characterize uncharacterized proteins or design functional variants from sequence alone. The zero-shot discovery workflow identified ADMAP1 as a regulator of sperm motility, validated by CRISPR-Cas9 knockout mouse and by immunofluorescence showing co-localization with microtubules and the ciliary marker ARL13B. The generative workflow targets enzyme- and cytoskeleton-modulating design, producing synthetic actin-remodeling proteins that retain F-actin severing activity, illustrating use cases from functional annotation to de novo protein engineering.

Impact

By framing proteins as sequences of discrete words, this work advances a hierarchical alternative to residue-level protein language models and demonstrates that such representations can drive genuine biological discovery rather than benchmark gains alone. The combination of a novel ciliary protein discovered and validated in vivo with functional de novo designs validated in cells is an unusually complete sequence-to-phenotype loop for a representation-learning paper. Pretrained weights and evaluation data are released on Zenodo, though the weights carry a use-restriction (RAIL-M) license and the cited code repository was not publicly accessible at review, so fully open reproduction still awaits an accessible implementation. Even so, the approach offers a compelling template for connecting interpretable protein vocabularies to both discovery and design.

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Guo, Z., et al. (2026) Hierarchical latent representations reveal protein organization for functional discovery and design. bioRxiv.

DOI: 10.64898/2026.02.14.705947

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

24Closed

Usability — can I run it?11

Reproducibility — can I retrain it?39

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Dataset Model Weights (Zenodo)

Key Features

Discrete protein vocabulary: Proteins are tokenized into a learned set of recurring "words" via vector quantization, yielding a compact, interpretable alphabet that exposes higher-order organization beyond single residues.

Hierarchical, sequence-only representation: The model operates on sequence alone yet recovers module-level structural and functional patterns, making it applicable to uncharacterized proteins without solved structures.

Zero-shot functional discovery: Embeddings transfer directly to remote homology detection and mutation effect prediction, and the same semantic axis surfaced ADMAP1, a previously uncharacterized regulator.

Autoregressive generative design: A generative model over the word vocabulary produces novel synthetic proteins, demonstrated by designs that retain native fold architecture at low sequence identity.

In-vivo validation: Predictions are confirmed experimentally, including a CRISPR-Cas9 mouse knockout and cell-based functional assays.

Technical Details

Applications

Impact

Protein Words Language Model

Key Features

Technical Details

Applications

Impact

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Protein Words Language Model

Key Features

Technical Details

Applications

Impact

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Protein Words Language Model

#Key Features

#Technical Details

#Applications

#Impact

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Protein Words Language Model

#Key Features

#Technical Details

#Applications

#Impact

Citation

Hierarchical latent representations reveal protein organization for functional discovery and design

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact