Griffith University / Monash University / Fudan University
A Dirichlet flow-matching model that generates family-aware protein sequences by initializing from ancestral-reconstruction lineage priors rather than random noise.
LineageFlow is a generative model for protein sequence design that produces sequences which are both biophysically plausible and recognizable members of a target protein family while still exploring meaningful within-family diversity. Introduced in a 2026 preprint (arXiv:2605.22252, accepted at ICML 2026) by researchers at Griffith University, Monash University, and Fudan University, it reframes a common limitation of diffusion- and flow-based sequence generators: starting from uninformative random noise tends to either drift outside the family or collapse toward over-conserved consensus sequences.
The model's central idea is to replace the random starting distribution with a lineage prior derived from ancestral sequence reconstruction (ASR). Rather than denoising from scratch, LineageFlow begins from a phylogeny-informed ancestral distribution and transports it toward the space of extant (present-day) sequences using a shared Dirichlet flow-matching denoiser. This grounds generation in an evolved scaffold, so the model performs structured mutation from an evolutionarily reasonable starting point instead of inventing a family from noise.
LineageFlow also introduces "rerouting," a single intermediate-time mutate–select–amplify intervention that steers sampling toward a user-defined objective without requiring per-step predictor guidance. This makes objective-aware generation lightweight while preserving the family-validity benefits of the lineage prior.
LineageFlow couples a shared flow-matching denoiser, trained to transport lineage priors toward extant sequences, with per-family ASR priors computed from multiple sequence alignments. Training and evaluation assets are built from Pfam families: the released pipeline uses cleaned per-family FASTA training sequences, ASR-derived prior files, per-position gap-rate files, and a family-filtering table, with a released checkpoint (lineageflow-rp55.ckpt) trained on RP55 representative sequences. Sampling proceeds over the simplex with a default schedule of roughly 100 base steps and 50 final steps; rerouting runs a small number of rounds (default 3) over a population (default size 8) at an intermediate time, mutating a fraction of positions (default 25%) and selecting by a fitness scorer.
The authors evaluate generated sequences along four axes: family validity via profile-HMM scoring, foldability via mean OmegaFold pLDDT, self-consistency via ESM-IF (inverse-folding) perplexity, and novelty/diversity via MMseqs2-based clustering and identity statistics. Against flow- and diffusion-based baselines, LineageFlow reports improved structural confidence and family validity while retaining sequence diversity, supporting the claim that ancestral initialization yields higher-fidelity, family-faithful samples.
LineageFlow targets protein engineering and design workflows where the goal is to generate novel-but-valid variants of a known protein family — for example expanding an enzyme or binder family with candidates that fold reliably and retain the family fold. Because rerouting accepts arbitrary fitness scorers, practitioners can bias generation toward measurable objectives such as predicted foldability or model-scored sequence likelihood without retraining. The Pfam-based pipeline makes the approach broadly applicable across families with sufficient alignment depth for ancestral reconstruction, and generated sequences integrate naturally with downstream structure-prediction and inverse-folding tools used to triage candidates before wet-lab testing.
LineageFlow contributes to a growing line of work that replaces uninformative noise priors in generative sequence models with biologically structured initializations, here drawing on phylogenetics and ancestral sequence reconstruction. By demonstrating that lineage priors improve family validity and structural confidence over flow- and diffusion-based baselines, and by showing that a single intermediate-time rerouting step can guide generation cheaply, the work offers a practical recipe for family-aware protein design. As a recent preprint accepted at ICML 2026, its real-world adoption and experimental validation remain to be established; reported gains are computational (profile-HMM, OmegaFold pLDDT, ESM-IF perplexity, MMseqs2 diversity), and the method's dependence on alignment-derived ASR priors may limit applicability to families with sparse or low-quality alignments.