Overview

ProteinDT is a multimodal protein design framework developed by Shengchao Liu and collaborators from UC Berkeley, NVIDIA Research, Mila, and other institutions, published in Nature Machine Intelligence in 2025. It addresses a fundamental challenge in protein engineering: enabling researchers to specify desired protein properties in natural language and receive valid, functional protein sequences in return — without requiring deep expertise in structural biology or sequence design.

The core innovation is a contrastive alignment between two modalities — natural language descriptions and protein sequences — that allows the two to share a common representation space. Once aligned, text descriptions can be used to steer both sequence generation and editing. This stands in contrast to purely sequence-based or structure-based design tools, which require domain-specific encoding of design goals rather than natural language inputs.

The training dataset, SwissProtCLAP, consists of approximately 441,000 text-protein pairs extracted from the SwissProt subset of UniProt, grounding the model's language understanding in curated scientific annotations covering a wide range of protein functions, stability properties, and binding characteristics.

Key Features

Text-Guided Sequence Generation: Generates novel protein sequences conditioned on natural language descriptions, achieving over 90% accuracy on text-to-protein generation benchmarks across diverse functional categories.
Zero-Shot Protein Editing: Modifies existing protein sequences based on textual instructions across 12 distinct editing tasks, including stability enhancement, structure optimization, and peptide-binding modifications, without task-specific fine-tuning.
ProteinCLAP Alignment: A contrastive learning module (Contrastive Language and Protein Pretraining) aligns protein and text representations into a shared embedding space, enabling seamless cross-modal translation.
Multi-Task Property Prediction: Predicts protein properties by leveraging joint text-protein representations, achieving superior performance on 4 out of 6 benchmarks spanning structural, stability, and binding properties.
Dual Editing Strategies: Supports both latent interpolation (smooth transitions between representations) and latent optimization (direct optimization toward a target description), giving users flexibility for different design scenarios.

Technical Details

ProteinDT is built around a three-stage pipeline. First, ProteinCLAP trains a contrastive alignment between a protein sequence encoder and a text encoder, using paired SwissProtCLAP data to bring semantically related text-protein pairs close in embedding space. Second, a protein facilitator module learns to generate protein sequence representations from text embeddings alone, effectively bridging the gap between language and the protein embedding space learned by ProteinCLAP. Third, a conditional protein decoder translates these representations into full amino acid sequences.

For editing tasks, ProteinDT operates in the shared latent space rather than at the sequence level. Latent optimization iteratively adjusts a protein's representation toward a target text description, then decodes the modified representation back to a sequence. Benchmarking demonstrates best hit ratio across 12 zero-shot editing tasks evaluated under 21 distinct evaluation methods. For property prediction, the model leverages the aligned cross-modal embeddings as features, outperforming sequence-only baselines on tasks requiring understanding of stability and binding context.

Applications

ProteinDT is primarily aimed at drug discovery and protein therapeutics, where researchers need to engineer candidates with specified binding affinities, thermostability profiles, or reduced immunogenicity. By accepting natural language specifications — for example, "high binding affinity to target receptor with improved thermal stability" — the framework lowers the barrier to entry for wet-lab biologists who are not fluent in sequence-based design. It is also applicable to synthetic biology workflows for designing novel enzymes with custom catalytic properties, and to functional annotation tasks where predicted property scores supplement experimental characterization.

Impact

ProteinDT was among the first frameworks to demonstrate that natural language can serve as a practical interface for protein sequence design, influencing subsequent work on multimodal biological foundation models. Published in Nature Machine Intelligence, the work established contrastive text-protein alignment as a viable pretraining strategy and opened a research direction distinct from structure-conditioned design methods. A key limitation is that ProteinDT operates at the sequence level and does not predict or optimize three-dimensional structure directly; users requiring structural validation of generated sequences must pair it with a structure prediction tool such as ESMFold or AlphaFold 2. The reliance on SwissProt annotations also means that protein functions poorly covered by curated databases may be underrepresented in the model's generalization capacity.

Overview

Key Features

Text-Guided Sequence Generation: Generates novel protein sequences conditioned on natural language descriptions, achieving over 90% accuracy on text-to-protein generation benchmarks across diverse functional categories.

Zero-Shot Protein Editing: Modifies existing protein sequences based on textual instructions across 12 distinct editing tasks, including stability enhancement, structure optimization, and peptide-binding modifications, without task-specific fine-tuning.

ProteinCLAP Alignment: A contrastive learning module (Contrastive Language and Protein Pretraining) aligns protein and text representations into a shared embedding space, enabling seamless cross-modal translation.

Multi-Task Property Prediction: Predicts protein properties by leveraging joint text-protein representations, achieving superior performance on 4 out of 6 benchmarks spanning structural, stability, and binding properties.

Dual Editing Strategies: Supports both latent interpolation (smooth transitions between representations) and latent optimization (direct optimization toward a target description), giving users flexibility for different design scenarios.

Technical Details

Applications

Impact

ProteinDT

Overview

Key Features

Technical Details

Applications

Impact

Citation

A text-guided protein design framework

Metrics

GitHub

Citations

Tags

Resources

ProteinDT

Overview

Key Features

Technical Details

Applications

Impact

Citation

A text-guided protein design framework

Metrics

GitHub

Citations

Tags

Resources