A 16B-parameter framework for de novo protein design from natural language, converting text descriptions into functional protein sequences via two-stage structure-conditioned generation.
Pinal is a generative framework developed at Westlake University that translates natural language descriptions directly into de novo protein sequences. Rather than requiring a template structure, a family of homologous sequences, or specialized bioinformatics expertise, Pinal accepts plain-text instructions such as a functional description or protein family name and produces novel candidate proteins computationally designed to match that specification. This positions natural language as a programmable interface for biology — a significant step toward democratizing protein engineering beyond specialist practitioners.
The core challenge Pinal addresses is the combinatorial complexity of protein sequence space. Direct text-to-sequence generation would require the model to simultaneously reason about function, fold, and biophysics in a single step. Pinal sidesteps this by decomposing the problem into two stages: first generating a backbone structure from the text input, then designing an amino acid sequence conditioned jointly on both the generated structure and the original language description. Operating through structural intermediates constrains the search space to physically plausible conformations while preserving the functional grounding encoded in the text.
Pinal was developed by a team led by Fajie Yuan at Westlake University, with collaborators from the Hong Kong University of Science and Technology (Guangzhou) and the University of Washington. The preprint was first posted to bioRxiv in August 2024 and substantially revised through 2025, with experimental validation demonstrating wet-lab activity for designed proteins across multiple functional classes.
Pinal comprises four trained components: T2struc-1.2B and T2struc-15B (text-to-structure modules), and SaProt-T and SaProt-O (structure-conditioned sequence design modules, 760M parameters each). The text-to-structure modules are trained on a synthetic corpus of 1.7 billion protein-text pairs constructed to ground functional language in structural context. Outputs from these modules are backbone coordinate sets, which are passed alongside the original text query to SaProt for sequence generation. Candidate designs are ranked by multiple quality metrics including pLDDT confidence from structure validation (via AlphaFold or ESMFold), Predicted Aligned Error (PAE), and sequence-text similarity scores computed using a cross-modal alignment model (ProTrek). The full pipeline is implemented in Python 3.8 and distributed with pretrained weights via HuggingFace, requiring approximately 30 minutes for dependency installation.
Benchmark comparisons reported in the preprint indicate that Pinal exceeds prior approaches including the concurrent ESM3 model on computational metrics for novel protein generation. Critically, functional validation was performed in the wet lab: among proteins designed de novo using only natural language prompts, Pinal produced functional alcohol dehydrogenases (4 of 8 tested showed catalytic activity), a functional fluorescent protein, a PET hydrolase with catalytic turnover, and a metabolic H-protein that surpassed the natural counterpart with 1.7-fold higher measured activity.
Pinal is suited for researchers seeking to design novel proteins with specified functional properties without access to a close structural or sequence template. Synthetic biologists can specify desired enzymatic activities in plain text and receive candidate sequences for experimental screening, dramatically compressing the design-build-test cycle. Pharmaceutical researchers can prompt for protein scaffolds with target-binding or catalytic profiles. The framework is particularly valuable in cases where no natural protein family closely matches the desired function, as the generative approach is not constrained by homology. The online server further lowers the barrier for wet-lab scientists who lack computational infrastructure to run the full pipeline locally.
Pinal represents a meaningful advance in the emerging field of language-guided protein design, demonstrating for the first time that a natural language interface can drive experimental protein discovery at multiple functional classes simultaneously. Its wet-lab validation — including a designed enzyme that outperforms its natural counterpart — elevates it beyond purely computational benchmarks and establishes a credible proof of concept for text-programmable biology. The work is a preprint as of early 2026 and has not yet undergone formal peer review; independent replication will be important for establishing the generality of the results. Key limitations include the reliance on synthetic training data for the text-structure grounding, the computational cost of the 15B-parameter variant, and the inherent challenge of specifying nuanced biophysical requirements through natural language alone. Nonetheless, Pinal contributes a compelling demonstration that large-scale multimodal training can bridge human intent and molecular function.
Dai, F., et al. (2025) Toward De Novo Protein Design from Natural Language. bioRxiv.
DOI: 10.1101/2024.08.01.606258