Microsoft Research
Sequence-first protein generation framework using discrete diffusion over evolutionary alignments, enabling controllable de novo design without structure.
The dominant paradigm in AI-guided protein design has been structure-first: generate a backbone geometry, then design a sequence to fold into it. This workflow, while powerful, inherits the limitations of the structural universe — models can only generate sequences compatible with three-dimensional forms that appear in or near the Protein Data Bank, a relatively small sample of the vast combinatorial space of possible protein sequences and functions. Intrinsically disordered proteins, which lack a stable three-dimensional structure yet perform essential biological functions, are essentially inaccessible to structure-based design frameworks. The same is true for proteins whose function emerges from conformational flexibility rather than a single folded state. EvoDiff, introduced by researchers at Microsoft Research New England in September 2023, proposes a fundamentally different approach: design proteins in sequence space, conditioned on evolutionary information, without any reference to three-dimensional structure.
EvoDiff is a general-purpose discrete diffusion framework for protein sequence and multiple sequence alignment (MSA) generation. By framing protein design as a sequence generation problem conditioned on evolutionary context — rather than a structure completion problem — EvoDiff sidesteps the limitations of the structural universe and gains access to the full diversity of natural sequence space, including disordered regions, flexible loops, and evolutionary innovations that have not been captured in structural databases. The framework is designed to be maximally general: it can generate proteins unconditionally from scratch, condition on evolutionary information encoded in MSAs to design sequences that fit a particular evolutionary family, inpaint functional domains into scaffold sequences, and scaffold structural motifs from sequence data alone without requiring a pre-specified backbone.
Experimental validation across multiple protein classes — including intrinsically disordered mitochondrial targeting signals, metal-binding proteins, and protein-protein interaction binders — demonstrates that EvoDiff-generated sequences fold, express, and exhibit the expected structural and functional properties in the wet lab. This cross-domain experimental validation, combined with the framework's theoretical generality, positions EvoDiff as a complementary approach to structure-based methods that expands the designable sequence space rather than competing for territory within the existing structural paradigm. EvoDiff was authored by Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X. Lu, Nicolo Fusi, Ava P. Amini, and Kevin K. Yang.
EvoDiff is implemented in two main variants with distinct architectures. EvoDiff-Seq trains on 42 million sequences from UniRef50 using the CARP (Convolutional Amino acid Representation for Proteins) architecture — a dilated convolutional neural network originally developed for protein masked language modeling. Two model sizes are trained for each corruption scheme: 38M and 640M parameters. The 640M-parameter EvoDiff-Seq model is the primary reported model for unconditional generation benchmarks and motif scaffolding tasks. The dilated CNN architecture processes sequences as one-dimensional signals with exponentially increasing receptive fields across layers, capturing both local sequence patterns and long-range dependencies without the quadratic attention cost of transformers at long sequence lengths.
EvoDiff-MSA uses the MSA Transformer architecture — originally developed by Meta AI for the ESM model family — to condition generation on multiple sequence alignments. MSAs are subsampled to 512 residue positions and 64 sequences per alignment during training, using either random sequence sampling or diversity-maximizing greedy subsampling. The MSA Transformer's row and column attention operations naturally model both within-sequence and between-sequence dependencies, making it well-suited for learning the joint distribution of sequences within an evolutionary family.
The forward diffusion process is implemented in two variants. In OADM, the forward process masks one position at a time in a random order determined by a permutation, and the denoising model learns to predict the unmasked token at each position. In D3PM, the forward process corrupts sequences using a substitution matrix based on amino acid mutation frequencies, and the denoising model learns to reverse this corruption. Both schemes define a discrete Markov chain over the sequence space that converges to a fully corrupted (masked or randomized) sequence at time T, and the generative process runs the chain in reverse, progressively revealing a structured protein sequence from noise.
Experimental evaluation included unconditional generation assessed by perplexity under language models (ESM-1v), diversity metrics (Hamming distance and sequence identity clustering), and structural plausibility assessed by running ESMFold on generated sequences. For conditional tasks, sequences generated by motif scaffolding were tested for compatibility with the scaffolded motif using ESMFold-predicted structures, and three experimental systems — mitochondrial targeting signals, metal-binding zinc finger motifs, and protein binders — were synthesized, expressed, and characterized. Expression rates and structural validation by circular dichroism confirmed biological plausibility of EvoDiff generations across all tested systems.
EvoDiff's core application is protein sequence generation for cases where structure-based methods are inadequate or inapplicable. For researchers studying intrinsically disordered proteins — a class that includes many transcription factors, signaling proteins, and disease-relevant aggregating proteins — EvoDiff provides the first generative framework capable of designing sequences in this regime without imposing a folded structure constraint. For protein engineering applications, the MSA conditioning capability enables researchers to generate new members of a protein family that inherit the evolutionary constraints of the family while diversifying at positions that tolerate variation, a strategy that can be used to generate libraries for directed evolution or to explore functional diversity within a natural protein class. The motif scaffolding capability is directly applicable to enzyme design and therapeutic protein engineering, where a defined functional site — a catalytic triad, a metal-binding loop, a receptor-binding interface — needs to be embedded in a new scaffold with desired properties. Because EvoDiff operates without reference to structure, it can be applied to any protein for which an MSA can be computed, extending controllable design to the full breadth of the known protein universe.
EvoDiff's primary contribution to the field is demonstrating that sequence-space diffusion, conditioned on evolutionary information, can produce biologically valid proteins across a wider range of functional classes than structure-based methods — and doing so with direct experimental validation rather than in silico metrics alone. The framework's successful generation of intrinsically disordered proteins in particular fills a genuine gap in the protein design toolkit, as this class of proteins is both biologically important and systematically excluded from structure-based design approaches. The open-source release on GitHub has enabled the research community to apply and extend the framework, and the comprehensive experimental characterization published alongside the computational methodology sets a rigorous standard for validating generative protein design methods. Subsequent work from other groups has built on the OADM formulation and the evolutionary conditioning approach. Key limitations include the lack of direct structural control during generation — users cannot specify a target fold — and the sequence-only nature of the framework means that predictions of three-dimensional compatibility require running a separate structure prediction model as a post-hoc filter. Additionally, the 640M-parameter flagship model is smaller than the largest protein language models, and generation quality on highly structured, evolutionarily conserved protein families may lag behind dedicated structure-based design methods for those targets.