Overview

Pinal is a generative framework developed at Westlake University that translates natural language descriptions directly into de novo protein sequences. Rather than requiring a template structure, a family of homologous sequences, or specialized bioinformatics expertise, Pinal accepts plain-text instructions such as a functional description or protein family name and produces novel candidate proteins computationally designed to match that specification. This positions natural language as a programmable interface for biology — a significant step toward democratizing protein engineering beyond specialist practitioners.

The core challenge Pinal addresses is the combinatorial complexity of protein sequence space. Direct text-to-sequence generation would require the model to simultaneously reason about function, fold, and biophysics in a single step. Pinal sidesteps this by decomposing the problem into two stages: first generating a backbone structure from the text input, then designing an amino acid sequence conditioned jointly on both the generated structure and the original language description. Operating through structural intermediates constrains the search space to physically plausible conformations while preserving the functional grounding encoded in the text.

Pinal was developed by a team led by Fajie Yuan at Westlake University, with collaborators from the Hong Kong University of Science and Technology (Guangzhou) and the University of Washington. The preprint was first posted to bioRxiv in August 2024 and substantially revised through 2025, with experimental validation demonstrating wet-lab activity for designed proteins across multiple functional classes.

Key Features

Natural language input: Accepts arbitrary functional descriptions as input — from protein family names to descriptive functional phrases — without requiring sequence templates or structural databases as a starting point.
Two-stage structure-conditioned generation: Text is first used to generate a protein backbone via the T2struc module, and sequences are then designed conditioned on both structure and language using SaProt-T, ensuring biophysical plausibility of all outputs.
16-billion parameter foundation model: The flagship T2struc-15B component contains 15.5 billion parameters, trained on 1.7 billion synthetic protein-text pairs, grounding functional language in structural and biophysical principles at scale.
Multiple scale options: T2struc-1.2B (1.2B parameters) enables fast inference on a single A40 GPU in approximately one minute for ten proteins, while T2struc-15B (requiring 40GB+ GPU memory) offers higher design quality.
Protein redesign support: The SaProt-O component supports sequence editing and redesign tasks, enabling users to refine or optimize existing sequences rather than starting purely from scratch.
Online server access: An interactive web server at denovo-pinal.com provides access to the full pipeline without local installation, broadening accessibility to wet-lab researchers.

Technical Details

Pinal comprises four trained components: T2struc-1.2B and T2struc-15B (text-to-structure modules), and SaProt-T and SaProt-O (structure-conditioned sequence design modules, 760M parameters each). The text-to-structure modules are trained on a synthetic corpus of 1.7 billion protein-text pairs constructed to ground functional language in structural context. Outputs from these modules are backbone coordinate sets, which are passed alongside the original text query to SaProt for sequence generation. Candidate designs are ranked by multiple quality metrics including pLDDT confidence from structure validation (via AlphaFold or ESMFold), Predicted Aligned Error (PAE), and sequence-text similarity scores computed using a cross-modal alignment model (ProTrek). The full pipeline is implemented in Python 3.8 and distributed with pretrained weights via HuggingFace, requiring approximately 30 minutes for dependency installation.

Benchmark comparisons reported in the preprint indicate that Pinal exceeds prior approaches including the concurrent ESM3 model on computational metrics for novel protein generation. Critically, functional validation was performed in the wet lab: among proteins designed de novo using only natural language prompts, Pinal produced functional alcohol dehydrogenases (4 of 8 tested showed catalytic activity), a functional fluorescent protein, a PET hydrolase with catalytic turnover, and a metabolic H-protein that surpassed the natural counterpart with 1.7-fold higher measured activity.

Applications

Pinal is suited for researchers seeking to design novel proteins with specified functional properties without access to a close structural or sequence template. Synthetic biologists can specify desired enzymatic activities in plain text and receive candidate sequences for experimental screening, dramatically compressing the design-build-test cycle. Pharmaceutical researchers can prompt for protein scaffolds with target-binding or catalytic profiles. The framework is particularly valuable in cases where no natural protein family closely matches the desired function, as the generative approach is not constrained by homology. The online server further lowers the barrier for wet-lab scientists who lack computational infrastructure to run the full pipeline locally.

Impact

Pinal represents a meaningful advance in the emerging field of language-guided protein design, demonstrating for the first time that a natural language interface can drive experimental protein discovery at multiple functional classes simultaneously. Its wet-lab validation — including a designed enzyme that outperforms its natural counterpart — elevates it beyond purely computational benchmarks and establishes a credible proof of concept for text-programmable biology. The work is a preprint as of early 2026 and has not yet undergone formal peer review; independent replication will be important for establishing the generality of the results. Key limitations include the reliance on synthetic training data for the text-structure grounding, the computational cost of the 15B-parameter variant, and the inherent challenge of specifying nuanced biophysical requirements through natural language alone. Nonetheless, Pinal contributes a compelling demonstration that large-scale multimodal training can bridge human intent and molecular function.

Overview

Key Features

Natural language input: Accepts arbitrary functional descriptions as input — from protein family names to descriptive functional phrases — without requiring sequence templates or structural databases as a starting point.

Two-stage structure-conditioned generation: Text is first used to generate a protein backbone via the T2struc module, and sequences are then designed conditioned on both structure and language using SaProt-T, ensuring biophysical plausibility of all outputs.

16-billion parameter foundation model: The flagship T2struc-15B component contains 15.5 billion parameters, trained on 1.7 billion synthetic protein-text pairs, grounding functional language in structural and biophysical principles at scale.

Multiple scale options: T2struc-1.2B (1.2B parameters) enables fast inference on a single A40 GPU in approximately one minute for ten proteins, while T2struc-15B (requiring 40GB+ GPU memory) offers higher design quality.

Protein redesign support: The SaProt-O component supports sequence editing and redesign tasks, enabling users to refine or optimize existing sequences rather than starting purely from scratch.

Online server access: An interactive web server at denovo-pinal.com provides access to the full pipeline without local installation, broadening accessibility to wet-lab researchers.

Technical Details

Applications

Impact

Pinal

Overview

Key Features

Technical Details

Applications

Impact

Citation

Toward De Novo Protein Design from Natural Language

Metrics

GitHub

Citations

Tags

Resources

Pinal

Overview

Key Features

Technical Details

Applications

Impact

Citation

Toward De Novo Protein Design from Natural Language

Metrics

GitHub

Citations

Tags

Resources