bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

IDiom

Chinese Academy of Sciences

Autoregressive language model trained on 37 million intrinsically disordered region sequences from the AlphaFold Database, generating IDR sequences conditioned on surrounding structured context.

Released: 2026

Overview

IDiom is an autoregressive language model purpose-built for designing intrinsically disordered protein regions (IDRs), posted to bioRxiv in mid-April 2026. Trained on 37 million IDR sequences extracted from the AlphaFold Database, IDiom generates IDR sequences conditioned on surrounding structured context using a fill-in-the-middle augmentation strategy that lets it complete a disordered region given the flanking folded domains.

IDiom addresses a major blind spot in modern protein design: AlphaFold-based generative models such as RFdiffusion are inherently structure-biased and struggle to design sequences that lack stable folds, even though IDRs make up roughly 30 percent of the human proteome and play essential roles in signaling, regulation, and condensate formation.

Key Features

  • IDR-specialized training corpus: 37M IDR sequences drawn from the AlphaFold Database using pLDDT-based disorder calls.
  • Fill-in-the-middle generation: Trained to fill in IDR sequences conditioned on surrounding folded context, matching the practical use case for IDR design in chimeric or modular proteins.
  • Sequence-only generative model: Operates on sequence alone, without requiring 3D coordinates, making it complementary to structure-based design tools.
  • Captures compositional bias and motifs: Generates sequences with realistic amino-acid composition and short linear motif distributions characteristic of natural IDRs.
  • Open preprint and code: bioRxiv preprint with code release for community use.

Technical Details

IDiom uses a decoder-only transformer trained autoregressively on the IDR corpus, with a fill-in-the-middle augmentation following the InCoder/CodeLlama recipe. This allows the model to produce IDRs with explicit conditioning on N- and C-terminal flanking sequences. The training data is filtered using AlphaFold per-residue confidence (pLDDT) thresholds to identify likely-disordered segments.

Benchmarks reported in the preprint compare generated sequences against natural IDR distributions on amino-acid composition, charge patterning, hydropathy, and short linear motif occurrence, and against held-out natural IDRs on perplexity.

Applications

IDiom is suited for protein engineers designing chimeric or modular proteins where IDR linkers, condensate-forming domains, or signaling-tail regions are required. It is also useful for synthetic biology applications involving designed IDPs, such as engineered phase-separating proteins or modular signaling scaffolds. The fill-in-the-middle interface makes it directly compatible with workflows that combine structure-based scaffold generation (RFdiffusion) with sequence-based IDR completion.

Impact

IDiom is the first language model purpose-built for IDP/IDR design and addresses a long-standing gap in the protein-design toolkit. Its release expands the practical reach of generative protein design beyond the structured proteome and complements AlphaFold-based generative models by handling the disordered fraction they cannot address.

Citation

Generative design of intrinsically disordered protein regions with IDiom

Liu, J., et al. (2026) Generative design of intrinsically disordered protein regions with IDiom. bioRxiv.

DOI: 10.64898/2026.04.10.717777

Metrics

Citations

Total Citations0
Influential0
References93

Tags

protein designintrinsically disordered protein designsequence generationtransformerself-supervisedfoundation modelproteinintrinsically disordered region

Resources

Research Paper