Edit-based flow-matching model that generates variable-length protein variants via learned insertions, deletions, and substitutions on a template sequence.
EvoFlows is a generative protein-sequence model that learns mutational trajectories between evolutionarily related proteins and uses them to propose new variants. Rather than generating a sequence from scratch, it operates on a template sequence and applies a controllable number of edits—insertions, deletions, and substitutions—predicting both which edit to make and where to make it. This framing makes the model a natural fit for protein engineering and lead optimization, where the goal is usually to improve an existing protein rather than to design one de novo.
The approach addresses a structural mismatch between most protein language models and the optimization tasks they are applied to. Autoregressive models must regenerate a full sequence; masked language models and discrete diffusion models typically require the mutation locations to be specified in advance; and none of these paradigms naturally support length-changing edits (insertions and deletions) relative to a starting sequence. EvoFlows is built around a variable-length, edit-based formulation specifically to remove those constraints.
EvoFlows was developed by Nicolas Deutschmann, Constance Ferragu, Jonathan D. Ziegler, Shayan Aziznejad, and Eli Bixby at Cradle, an Amsterdam-based protein-engineering company. It was released as an arXiv preprint in March 2026 and presented at the Workshop on Foundation Models for Science at ICLR 2026.
EvoFlows builds on discrete flow matching (DFM) and edit flows, casting protein variant generation as a learned transport between distributions of evolutionarily related sequence pairs. Training data are constructed from pairwise alignments of homologous sequences drawn from UniRef (general proteins) and OAS (antibodies); at inference time the model iteratively samples edits to transform a template into a novel variant. The authors evaluate generated variants with a battery of in-silico metrics—including model-based pseudo-log-likelihood, covariance and mutual-information statistics, BLOSUM-corrected KL divergence, and a spectrum- kernel maximum mean discrepancy—comparing against existing generative baselines. Across diverse protein families, EvoFlows generated variants that stayed consistent with the source family's statistics while reaching greater mutational distance from the template than the baselines. Detailed architecture and hyperparameter settings are provided in the paper's appendices.
EvoFlows targets protein engineering and lead optimization, where teams iteratively improve a known protein—an enzyme, a binder, or a therapeutic antibody—rather than design one from nothing. Its ability to insert and delete residues, not just substitute them, makes it applicable to tasks such as loop remodeling, length variation in antibody CDRs, and broader sequence diversification, while the tunable edit budget lets users balance conservative refinement against more aggressive exploration. The OAS pretraining makes the antibody-engineering setting a particularly natural use case.
EvoFlows extends discrete flow matching to a length-variable, edit-based setting for proteins, filling a gap left by autoregressive, masked-language, and diffusion models that either regenerate whole sequences or assume fixed mutation positions. By aligning the generative process with how protein engineers actually work—editing a template under a controllable mutation budget—it offers a more directly applicable tool for optimization campaigns. As a recent preprint from an industry group, its long-term influence is still emerging, and at the time of writing no public code or model weights have been released, which currently limits independent reproduction and reuse.