ETH Zurich / University of Zurich
A latent flow-matching method that repurposes protein language model embeddings to generate high-fitness protein variants without predictor guidance during sampling.
Engineering proteins for higher fitness — stability, binding affinity, catalytic activity, or expression — is fundamentally a search problem over an astronomically large sequence space in which the rare high-fitness variants are sparsely scattered. Directed evolution and machine-learning-guided design both attempt to navigate this landscape efficiently, but generative approaches often rely on an external fitness predictor to steer sampling, which couples the quality of the generated sequences to the accuracy and differentiability of that predictor.
CHASE, introduced by Caceres Arroyo and colleagues at ETH Zurich and the University of Zurich in a February 2026 arXiv preprint, takes a different route. Rather than learning a generative model directly over amino-acid sequences, it works in the embedding space of a pretrained protein language model (PLM). The method compresses high-dimensional PLM embeddings into a reduced latent space and then trains a conditional flow-matching model in that latent space, using classifier-free guidance to bias generation toward high-fitness regions. Crucially, this lets CHASE generate candidate variants without querying a separate fitness predictor during the ODE sampling trajectory.
By reusing the rich evolutionary and structural priors already captured by a PLM, CHASE aims to make fitness-guided generation both data-efficient and robust, particularly in the common regime where only a small number of labeled variants are available.
CHASE couples a pretrained protein language model with a learned encoder that maps PLM embeddings into a reduced latent space, and a conditional flow-matching network that models the latent distribution. Classifier-free guidance is used at sampling time to shift the generative flow toward high-fitness variants, so that solving the learned ODE yields candidate sequences without a separate, differentiable fitness oracle in the loop. The authors evaluate CHASE on standard protein-fitness optimization benchmarks derived from adeno-associated virus (AAV) capsid and green fluorescent protein (GFP) datasets, reporting improvements over competing methods. They additionally demonstrate that bootstrapping with synthetic data can raise performance when labeled training data are scarce. As a recent preprint, the work does not yet have released code or weights, and exact architectural hyperparameters await the full release.
CHASE is aimed at protein engineers and computational biologists running machine-learning-guided directed-evolution campaigns, where the goal is to propose a small batch of promising variants for experimental testing. Because it operates in the embedding space of an existing protein language model and does not require a separate fitness predictor at sampling time, it is well suited to settings — such as early-stage campaigns on a novel target — where labeled fitness measurements are limited. Typical targets include improving binding, stability, or activity of enzymes, biologics, and engineered proteins such as AAV capsids.
CHASE contributes to a growing line of work that treats protein language model embeddings not just as fixed features but as a generative substrate, and it illustrates how flow-matching can be adapted to guided biological sequence design. Its predictor-free sampling and synthetic-data bootstrapping address two practical pain points — dependence on accurate differentiable oracles and scarcity of labeled variants. As a February 2026 preprint without released code or weights, its reported gains on the AAV and GFP benchmarks await peer review and independent reproduction before its practical advantages over established fitness-optimization methods can be fully assessed.