bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

Chroma

Generate:Biomedicines

Generative diffusion model for programmable protein design that jointly samples novel structures and sequences, conditioned on symmetry, shape, and natural language.

Released: 2023

Overview

Chroma is a generative model for programmable protein design developed by Generate:Biomedicines and published in Nature in November 2023. It uses a denoising diffusion probabilistic model (DDPM) framework to jointly generate protein backbone coordinates and amino acid sequences in a single unified process, rather than treating structure and sequence prediction as separate sequential steps. This joint generation approach enables Chroma to produce coherent, designable proteins that satisfy complex user-specified constraints simultaneously.

What distinguishes Chroma from prior protein design models is its Bayesian approach to programmable conditioning. External constraints — including symmetry groups, volumetric shape specifications, substructure scaffolds, semantic class labels, and natural language text prompts — are incorporated as likelihood functions that reweight the diffusion sampling trajectory at inference time. This means the base model does not need to be retrained to accommodate new constraint types; conditioning is applied zero-shot by composing the relevant likelihood terms during sampling. Chroma was the first generative protein model to demonstrate natural language conditioning for protein design.

Experimental validation was carried out at substantial scale. Of 310 Chroma-designed proteins that were synthesized and characterized, the expressed proteins folded with favorable biophysical properties as measured by circular dichroism. Two designs were solved by X-ray crystallography, and backbone RMSD to Chroma's predicted structures was approximately 1.0 angstrom, confirming atomic-level agreement between computational prediction and experimental structure.

Key Features

  • Programmable multi-modal conditioning: Designs proteins subject to constraints on symmetry, substructure, volumetric shape, semantic class, and natural language prompts — all within a single unified framework, applied zero-shot via Bayesian reweighting without retraining.
  • Joint structure and sequence generation: Simultaneously generates backbone coordinates and amino acid sequences through a coupled DDPM and conditional random field, producing tightly consistent structure-sequence pairs.
  • Sub-quadratic scaling: A random graph neural network with stochastically sampled long-range connectivity, inspired by fast N-body algorithms, gives O(N log N) scaling in residue count, enabling inference on large proteins and multi-chain assemblies.
  • All-atom output: Generates full all-atom models including sidechain conformations, not just backbone traces, providing ready-to-use structural models.
  • Natural language protein design: CLIP-derived text embeddings allow users to specify desired protein characteristics in plain English, biasing sampling toward matching protein classes without task-specific fine-tuning.
  • Experimental validation at scale: 310 synthesized designs showed high expression and correct folding; two crystal structures confirmed atomic-level accuracy.

Technical Details

Chroma combines three tightly integrated technical components. The generative backbone is a DDPM defined over protein backbone coordinates, where the forward noising process corrupts structures toward a polymer-ensemble prior that respects the chain statistics of polypeptides rather than a simple isotropic Gaussian. This physically motivated prior improves the quality of samples during reverse diffusion.

The denoising network is an SE(3)-equivariant random graph neural network (EGNN). At each diffusion step, long-range residue connectivity is sampled stochastically, inspired by O(N log N) fast multipole and N-body algorithms. This architecture handles rotations and translations of input coordinates correctly without data augmentation and scales sub-quadratically in sequence length. Given a backbone, amino acid sequences are sampled from a conditional random field (CRF) whose potentials are predicted by the same graph neural network, tightly coupling structure and sequence.

Conditioning is implemented through Bayesian inference: each constraint type (symmetry group, shape volume, substructure scaffold, semantic label, or CLIP-encoded text prompt) is formulated as a likelihood function that modifies the score function during reverse diffusion. Training used protein structures from the Protein Data Bank (queried 20 March 2022), UniProt 2022_01, and PFAM. De novo generated backbones achieve high self-consistency TM-scores when tested with ProteinMPNN and ESMFold in a round-trip designability evaluation.

Applications

Chroma is suited to protein engineering workflows where functional or structural constraints must be satisfied simultaneously. Researchers can design entirely novel protein folds and sequences as scaffolds or therapeutic candidates, enforce specific symmetry groups such as cyclic, dihedral, or icosahedral symmetry to produce nanoassemblies and nanocages, or fix a known binding motif or functional site and generate a new protein fold around it. Shape-conditioned design allows proteins to be generated to fit a specified volumetric envelope, and natural language conditioning makes the tool accessible to researchers who wish to explore a protein class without manually specifying geometric constraints. The model is released as an open-source Python package under the Generate Biomedicines Community License, with pretrained weights distributed through the GitHub repository.

Impact

Chroma's publication in Nature marked a significant advance in the programmable protein design field, demonstrating for the first time that a single generative model could accommodate diverse conditioning modalities — including free-text prompts — within a coherent probabilistic framework. The experimental validation campaign, with 310 synthesized designs and two crystal structures confirming near-perfect structural accuracy, provided unusually strong empirical grounding for a de novo design method. The sub-quadratic architecture set a precedent for scalable protein generative models capable of handling large complexes. Limitations include the approximate nature of natural language conditioning, which biases toward well-represented protein classes but does not guarantee precise semantic control; the absence of direct functional optimization for properties such as binding affinity or catalytic activity; and an anchoring of the training distribution to PDB-known folds, which may reduce reliability for truly exotic topologies outside that distribution.

Citations

Illuminating protein space with a programmable generative model

Ingraham, J., et al. (2022) Illuminating protein space with a programmable generative model. bioRxiv.

DOI: 10.1038/s41586-023-06728-8

Preprint

DOI: 10.1101/2022.12.01.518682

DOI: 10.1101/2022.12.01.518682

Metrics

GitHub

Stars812
Forks112
Open Issues25
Contributors4
Last Push2y ago
LanguagePython
LicenseApache-2.0

Tags

protein designstructure predictiondiffusiongraph neural networkgenerativenatural language

Resources

GitHub RepositoryResearch PaperOfficial Website