Autoregressive language model trained on 37 million intrinsically disordered region sequences from the AlphaFold Database, generating IDR sequences conditioned on surrounding structured context.
IDiom is an autoregressive language model purpose-built for designing intrinsically disordered protein regions (IDRs), posted to bioRxiv in mid-April 2026. Trained on 37 million IDR sequences extracted from the AlphaFold Database, IDiom generates IDR sequences conditioned on surrounding structured context using a fill-in-the-middle augmentation strategy that lets it complete a disordered region given the flanking folded domains.
IDiom addresses a major blind spot in modern protein design: AlphaFold-based generative models such as RFdiffusion are inherently structure-biased and struggle to design sequences that lack stable folds, even though IDRs make up roughly 30 percent of the human proteome and play essential roles in signaling, regulation, and condensate formation.
IDiom uses a decoder-only transformer trained autoregressively on the IDR corpus, with a fill-in-the-middle augmentation following the InCoder/CodeLlama recipe. This allows the model to produce IDRs with explicit conditioning on N- and C-terminal flanking sequences. The training data is filtered using AlphaFold per-residue confidence (pLDDT) thresholds to identify likely-disordered segments.
Benchmarks reported in the preprint compare generated sequences against natural IDR distributions on amino-acid composition, charge patterning, hydropathy, and short linear motif occurrence, and against held-out natural IDRs on perplexity.
IDiom is suited for protein engineers designing chimeric or modular proteins where IDR linkers, condensate-forming domains, or signaling-tail regions are required. It is also useful for synthetic biology applications involving designed IDPs, such as engineered phase-separating proteins or modular signaling scaffolds. The fill-in-the-middle interface makes it directly compatible with workflows that combine structure-based scaffold generation (RFdiffusion) with sequence-based IDR completion.
IDiom is the first language model purpose-built for IDP/IDR design and addresses a long-standing gap in the protein-design toolkit. Its release expands the practical reach of generative protein design beyond the structured proteome and complements AlphaFold-based generative models by handling the disordered fraction they cannot address.
Liu, J., et al. (2026) Generative design of intrinsically disordered protein regions with IDiom. bioRxiv.
DOI: 10.64898/2026.04.10.717777