Conditioned pLM for Generative IDP Design

LAAS-CNRS / Centre de Biochimie Structurale

Encoder-decoder Transformer that generates intrinsically disordered protein sequences conditioned on target conformational-ensemble descriptors.

Released: April 2026

Intrinsically disordered proteins and protein regions (IDPs/IDRs) lack a stable folded structure and instead populate heterogeneous conformational ensembles. They mediate signaling, phase separation, and molecular recognition, yet their rational design has lagged behind that of folded proteins because there is no single target structure to design toward — the design objective is a distribution of conformations described by ensemble-averaged biophysical properties. This model, developed by Laure Carrière, Alexandre Huyghe, Mátyás Pajkos, Pau Bernadó, and Juan Cortés at LAAS-CNRS (Université de Toulouse) and the Centre de Biochimie Structurale (Montpellier), and posted to bioRxiv in April 2026, addresses that gap with a conditioned generative protein language model.

The approach frames IDR design as the inverse of ensemble prediction: rather than predicting an ensemble from a sequence, it generates amino acid sequences predicted to realize a specified set of ensemble descriptors. The architecture is a Transformer encoder-decoder that maps numerical conformational and physicochemical descriptors (the conditioning input) to disordered sequences (the output). This is explicitly the opposite direction from sequence-to-ensemble methods such as IDPForge, which predict conformational ensembles given a sequence.

A central finding is methodological rather than purely architectural: by training across datasets spanning roughly two orders of magnitude in size, the authors show that accurate control over the target conformational and physicochemical properties emerges only at large data scale. The work therefore frames disordered-protein design as a data-centric problem — "data is the limit" — in which the availability of sequence/ensemble training pairs, not model capacity alone, is the binding constraint.

Key Features

Descriptor-conditioned generation: Sequences are produced conditioned on target ensemble-level descriptors (e.g., chain compaction and physicochemical properties), giving users control over the biophysical behavior of the designed IDR rather than only its identity.
Inverse-direction design: The model performs sequence generation given desired ensemble properties — the complement of sequence-to-ensemble predictors like IDPForge — making it directly usable as a design tool.
Encoder-decoder Transformer: A numeric descriptor encoder feeds a sequence decoder, a clean conditional formulation that turns continuous biophysical targets into discrete amino acid outputs.
Data-scaling analysis: Systematic training across dataset sizes spanning two orders of magnitude quantifies how property control improves with data, an unusually explicit treatment of the data bottleneck in IDR design.
Ensemble-descriptor objective: Because IDPs have no single native fold, the model is supervised against ensemble-averaged properties, matching how disordered proteins are actually characterized experimentally and computationally.

Technical Details

The model is a Transformer encoder-decoder that ingests numerical descriptors of a target conformational ensemble and autoregressively generates a disordered amino acid sequence consistent with those descriptors. Training data are built from computed conformational ensembles of disordered regions, leveraging the authors' prior ensemble-modeling expertise (coil-database and AlphaFold-derived ensemble methods from the same groups), with each training pair coupling a sequence to the biophysical descriptors of its ensemble. The headline experiments vary training-set size across roughly two orders of magnitude and evaluate how faithfully generated sequences reproduce the requested conformational and physicochemical properties; the authors report that reliable property control is achieved only at the largest data scales tested, supporting their data-centric conclusion. As a 2026 preprint, exact parameter counts, released weights, and a public code repository were not confirmed at the time of cataloging.

Applications

The model is aimed at protein engineers and structural biologists who need to design disordered linkers, spacers, and IDRs with prescribed biophysical behavior — for example tuning chain compaction or other ensemble properties when building multidomain constructs, biosensors, or phase-separation-prone components. By specifying target descriptors instead of a target structure, researchers can generate candidate sequences whose predicted ensembles match a design objective, complementing experimental and simulation-based ensemble characterization in an iterative design loop.

Impact

Most generative protein-design tools target folded structures; this work extends conditioned generative modeling to the disordered proteome, where roughly a third of eukaryotic proteomic content resides and where rational design has been largely out of reach. Its most transferable contribution is the explicit demonstration that descriptor-conditioned IDR design is feasible but data-limited, reframing progress in the field around the construction of large sequence/ensemble datasets. As a recent preprint without confirmed released weights, its downstream adoption remains to be established, but it stakes out the inverse-design counterpart to sequence-to-ensemble predictors and offers a clear, data-centric agenda for the emerging area of disordered-protein engineering.

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Carrière, L., et al. (2026) Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit. bioRxiv.

DOI: 10.64898/2026.04.14.718363

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References36

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

10Closed

Usability — can I run it?7

Reproducibility — can I retrain it?10

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Descriptor-conditioned generation: Sequences are produced conditioned on target ensemble-level descriptors (e.g., chain compaction and physicochemical properties), giving users control over the biophysical behavior of the designed IDR rather than only its identity.

Inverse-direction design: The model performs sequence generation given desired ensemble properties — the complement of sequence-to-ensemble predictors like IDPForge — making it directly usable as a design tool.

Encoder-decoder Transformer: A numeric descriptor encoder feeds a sequence decoder, a clean conditional formulation that turns continuous biophysical targets into discrete amino acid outputs.

Data-scaling analysis: Systematic training across dataset sizes spanning two orders of magnitude quantifies how property control improves with data, an unusually explicit treatment of the data bottleneck in IDR design.

Ensemble-descriptor objective: Because IDPs have no single native fold, the model is supervised against ensemble-averaged properties, matching how disordered proteins are actually characterized experimentally and computationally.

Technical Details

Applications

Impact

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Carrière, L., et al. (2026) Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit. bioRxiv.

DOI: 10.64898/2026.04.14.718363

Conditioned pLM for Generative IDP Design

Key Features

Technical Details

Applications

Impact

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Conditioned pLM for Generative IDP Design

Key Features

Technical Details

Applications

Impact

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Conditioned pLM for Generative IDP Design

#Key Features

#Technical Details

#Applications

#Impact

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Conditioned pLM for Generative IDP Design

#Key Features

#Technical Details

#Applications

#Impact

Citation

Generative design of intrinsically disordered proteins based on conditioned protein language models: Data is the limit

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact