CHASE

Latent flow-matching method that repurposes protein language model embeddings to generate high-fitness protein variants without predictor guidance.

Released: February 2026

Engineering proteins for higher fitness — stability, binding affinity, catalytic activity, or expression — is fundamentally a search problem over an astronomically large sequence space in which the rare high-fitness variants are sparsely scattered. Directed evolution and machine-learning-guided design both attempt to navigate this landscape efficiently, but generative approaches often rely on an external fitness predictor to steer sampling, which couples the quality of the generated sequences to the accuracy and differentiability of that predictor.

CHASE, introduced by Caceres Arroyo and colleagues at ETH Zurich and the University of Zurich in a February 2026 arXiv preprint, takes a different route. Rather than learning a generative model directly over amino-acid sequences, it works in the embedding space of a pretrained protein language model (PLM). The method compresses high-dimensional PLM embeddings into a reduced latent space and then trains a conditional flow-matching model in that latent space, using classifier-free guidance to bias generation toward high-fitness regions. Crucially, this lets CHASE generate candidate variants without querying a separate fitness predictor during the ODE sampling trajectory.

By reusing the rich evolutionary and structural priors already captured by a PLM, CHASE aims to make fitness-guided generation both data-efficient and robust, particularly in the common regime where only a small number of labeled variants are available.

Key Features

Latent flow-matching over PLM embeddings: CHASE trains a conditional flow-matching model in a compressed latent space derived from protein language model embeddings, rather than directly over discrete sequences.
Predictor-free guided sampling: Using classifier-free guidance, the model steers generation toward high-fitness variants without calling an external fitness predictor during the ODE sampling process.
Embedding compression: High-dimensional PLM embeddings are reduced to a smaller latent space, making the generative dynamics more tractable while retaining the upstream model's learned priors.
Synthetic data bootstrapping: In low-data regimes, the authors show that augmenting training with synthetic data can improve performance.

Technical Details

CHASE couples a pretrained protein language model with a learned encoder that maps PLM embeddings into a reduced latent space, and a conditional flow-matching network that models the latent distribution. Classifier-free guidance is used at sampling time to shift the generative flow toward high-fitness variants, so that solving the learned ODE yields candidate sequences without a separate, differentiable fitness oracle in the loop. The authors evaluate CHASE on standard protein-fitness optimization benchmarks derived from adeno-associated virus (AAV) capsid and green fluorescent protein (GFP) datasets, reporting improvements over competing methods. They additionally demonstrate that bootstrapping with synthetic data can raise performance when labeled training data are scarce. As a recent preprint, the work does not yet have released code or weights, and exact architectural hyperparameters await the full release.

Applications

CHASE is aimed at protein engineers and computational biologists running machine-learning-guided directed-evolution campaigns, where the goal is to propose a small batch of promising variants for experimental testing. Because it operates in the embedding space of an existing protein language model and does not require a separate fitness predictor at sampling time, it is well suited to settings — such as early-stage campaigns on a novel target — where labeled fitness measurements are limited. Typical targets include improving binding, stability, or activity of enzymes, biologics, and engineered proteins such as AAV capsids.

Impact

CHASE contributes to a growing line of work that treats protein language model embeddings not just as fixed features but as a generative substrate, and it illustrates how flow-matching can be adapted to guided biological sequence design. Its predictor-free sampling and synthetic-data bootstrapping address two practical pain points — dependence on accurate differentiable oracles and scarcity of labeled variants. As a February 2026 preprint without released code or weights, its reported gains on the AAV and GFP benchmarks await peer review and independent reproduction before its practical advantages over established fitness-optimization methods can be fully assessed.

Citation

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Preprint

Arroyo, A. C., et al. (2026) Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization. arXiv.org.

DOI: 10.48550/arXiv.2602.02425

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References40

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

11Closed

Usability — can I run it?7

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Latent flow-matching over PLM embeddings: CHASE trains a conditional flow-matching model in a compressed latent space derived from protein language model embeddings, rather than directly over discrete sequences.

Predictor-free guided sampling: Using classifier-free guidance, the model steers generation toward high-fitness variants without calling an external fitness predictor during the ODE sampling process.

Embedding compression: High-dimensional PLM embeddings are reduced to a smaller latent space, making the generative dynamics more tractable while retaining the upstream model's learned priors.

Synthetic data bootstrapping: In low-data regimes, the authors show that augmenting training with synthetic data can improve performance.

Technical Details

Applications

Impact

CHASE

Key Features

Technical Details

Applications

Impact

Citation

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CHASE

Key Features

Technical Details

Applications

Impact

Citation

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CHASE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

CHASE

#Key Features

#Technical Details

#Applications

#Impact

Citation

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact