LAAS-CNRS / Centre de Biochimie Structurale
An encoder-decoder Transformer that generates intrinsically disordered protein sequences conditioned on target conformational-ensemble biophysical descriptors.
Intrinsically disordered proteins and protein regions (IDPs/IDRs) lack a stable folded structure and instead populate heterogeneous conformational ensembles. They mediate signaling, phase separation, and molecular recognition, yet their rational design has lagged behind that of folded proteins because there is no single target structure to design toward — the design objective is a distribution of conformations described by ensemble-averaged biophysical properties. This model, developed by Laure Carrière, Alexandre Huyghe, Mátyás Pajkos, Pau Bernadó, and Juan Cortés at LAAS-CNRS (Université de Toulouse) and the Centre de Biochimie Structurale (Montpellier), and posted to bioRxiv in April 2026, addresses that gap with a conditioned generative protein language model.
The approach frames IDR design as the inverse of ensemble prediction: rather than predicting an ensemble from a sequence, it generates amino acid sequences predicted to realize a specified set of ensemble descriptors. The architecture is a Transformer encoder-decoder that maps numerical conformational and physicochemical descriptors (the conditioning input) to disordered sequences (the output). This is explicitly the opposite direction from sequence-to-ensemble methods such as IDPForge, which predict conformational ensembles given a sequence.
A central finding is methodological rather than purely architectural: by training across datasets spanning roughly two orders of magnitude in size, the authors show that accurate control over the target conformational and physicochemical properties emerges only at large data scale. The work therefore frames disordered-protein design as a data-centric problem — "data is the limit" — in which the availability of sequence/ensemble training pairs, not model capacity alone, is the binding constraint.
The model is a Transformer encoder-decoder that ingests numerical descriptors of a target conformational ensemble and autoregressively generates a disordered amino acid sequence consistent with those descriptors. Training data are built from computed conformational ensembles of disordered regions, leveraging the authors' prior ensemble-modeling expertise (coil-database and AlphaFold-derived ensemble methods from the same groups), with each training pair coupling a sequence to the biophysical descriptors of its ensemble. The headline experiments vary training-set size across roughly two orders of magnitude and evaluate how faithfully generated sequences reproduce the requested conformational and physicochemical properties; the authors report that reliable property control is achieved only at the largest data scales tested, supporting their data-centric conclusion. As a 2026 preprint, exact parameter counts, released weights, and a public code repository were not confirmed at the time of cataloging.
The model is aimed at protein engineers and structural biologists who need to design disordered linkers, spacers, and IDRs with prescribed biophysical behavior — for example tuning chain compaction or other ensemble properties when building multidomain constructs, biosensors, or phase-separation-prone components. By specifying target descriptors instead of a target structure, researchers can generate candidate sequences whose predicted ensembles match a design objective, complementing experimental and simulation-based ensemble characterization in an iterative design loop.
Most generative protein-design tools target folded structures; this work extends conditioned generative modeling to the disordered proteome, where roughly a third of eukaryotic proteomic content resides and where rational design has been largely out of reach. Its most transferable contribution is the explicit demonstration that descriptor-conditioned IDR design is feasible but data-limited, reframing progress in the field around the construction of large sequence/ensemble datasets. As a recent preprint without confirmed released weights, its downstream adoption remains to be established, but it stakes out the inverse-design counterpart to sequence-to-ensemble predictors and offers a clear, data-centric agenda for the emerging area of disordered-protein engineering.