A hierarchical sequence-based protein representation that encodes proteins as discrete 'words' for zero-shot functional discovery and generative design.
Proteins are organized hierarchically: residues assemble into local structural and functional modules that recur across the proteome, much as letters form words. The Protein Words Language Model (internally "ProtWord"), developed in Guangshuo Ou's lab at Tsinghua University's Department of Basic Medical Sciences and posted to bioRxiv in February 2026, operationalizes this analogy. Rather than modeling proteins one residue at a time, it discretizes protein space into a learnable vocabulary derived from the evolutionary record, encoding each protein as a sequence of discrete "words."
This hierarchical, sequence-based view lets the model capture higher-order structural and functional signals that residue-level language models tend to miss, while remaining purely sequence-driven (no experimental structures are required at inference). The authors position the discrete vocabulary as both an analytical lens on protein organization and evolution and a generative substrate for design.
The work is notable for pairing a representation-learning contribution with direct wet-lab validation. The same vocabulary that supports zero-shot functional inference also drives autoregressive generation of synthetic proteins, and both capabilities are tested experimentally rather than left as in-silico benchmarks.
The representation is built with a vector-quantized variational autoencoder (VQ-VAE) that maps protein sequence into a discrete codebook of "words," pretrained on the broad evolutionary diversity of UniRef50. On standard benchmarks, the learned representation is highly competitive with established residue-level baselines for remote homology detection and mutation effect prediction. For generation, an autoregressive model over the word vocabulary was fine-tuned on homologs of the F-actin-severing protein cofilin; it produced synthetic variants that preserved the characteristic cofilin-fold architecture despite sharing less than 60% sequence identity with any known natural protein. Of these, several designs disrupted the intracellular actin filament network in cells, consistent with native cofilin activity. The preprint does not report a published parameter count. Pretrained weights are released on Zenodo (record 18640019), including the ProtWord-150M backbone, the VQ-VAE codebook (8,192 tokens), and the fine-tuned latent GPT, under a "ProtWord Open RAIL-M" use-restriction license; the accompanying data (evolutionary frequency matrices for 54 species, variant-effect evaluation sets, and CASA tracking data) is released under CC BY 4.0. The GitHub code repository cited in the preprint was not publicly accessible at the time of review, so training and inference code remains unavailable.
The model serves protein biologists and computational researchers who need to prioritize and characterize uncharacterized proteins or design functional variants from sequence alone. The zero-shot discovery workflow identified ADMAP1 as a regulator of sperm motility, validated by CRISPR-Cas9 knockout mouse and by immunofluorescence showing co-localization with microtubules and the ciliary marker ARL13B. The generative workflow targets enzyme- and cytoskeleton-modulating design, producing synthetic actin-remodeling proteins that retain F-actin severing activity, illustrating use cases from functional annotation to de novo protein engineering.
By framing proteins as sequences of discrete words, this work advances a hierarchical alternative to residue-level protein language models and demonstrates that such representations can drive genuine biological discovery rather than benchmark gains alone. The combination of a novel ciliary protein discovered and validated in vivo with functional de novo designs validated in cells is an unusually complete sequence-to-phenotype loop for a representation-learning paper. Pretrained weights and evaluation data are released on Zenodo, though the weights carry a use-restriction (RAIL-M) license and the cited code repository was not publicly accessible at review, so fully open reproduction still awaits an accessible implementation. Even so, the approach offers a compelling template for connecting interpretable protein vocabularies to both discovery and design.
Guo, Z., et al. (2026) Hierarchical latent representations reveal protein organization for functional discovery and design. bioRxiv.
DOI: 10.64898/2026.02.14.705947Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data