LineageFlow

Griffith University / Monash University / Fudan University

Dirichlet flow-matching model for protein design that generates family-aware sequences from ancestral-reconstruction priors, not random noise.

Released: May 2026

LineageFlow is a generative model for protein sequence design that produces sequences which are both biophysically plausible and recognizable members of a target protein family while still exploring meaningful within-family diversity. Introduced in a 2026 preprint (arXiv:2605.22252, accepted at ICML 2026) by researchers at Griffith University, Monash University, and Fudan University, it reframes a common limitation of diffusion- and flow-based sequence generators: starting from uninformative random noise tends to either drift outside the family or collapse toward over-conserved consensus sequences.

The model's central idea is to replace the random starting distribution with a lineage prior derived from ancestral sequence reconstruction (ASR). Rather than denoising from scratch, LineageFlow begins from a phylogeny-informed ancestral distribution and transports it toward the space of extant (present-day) sequences using a shared Dirichlet flow-matching denoiser. This grounds generation in an evolved scaffold, so the model performs structured mutation from an evolutionarily reasonable starting point instead of inventing a family from noise.

LineageFlow also introduces "rerouting," a single intermediate-time mutate–select–amplify intervention that steers sampling toward a user-defined objective without requiring per-step predictor guidance. This makes objective-aware generation lightweight while preserving the family-validity benefits of the lineage prior.

Key Features

Lineage-prior initialization: Generation starts from ancestral-reconstruction priors rather than random noise, anchoring samples to an evolved scaffold and improving family validity.
Dirichlet flow matching: A flow-matching denoiser operating on the probability simplex over amino acids transports the ancestral prior toward extant sequence distributions.
Rerouting for objective-guided sampling: A single intermediate-time mutate–select–amplify step (population-based) steers sequences toward a target objective without per-step gradient or predictor guidance.
Family-aware diversity: Produces sequences that remain recognizable family members while exploring within-family variation, avoiding both family drift and consensus collapse.
Pluggable fitness scorers: Rerouting can be guided by ESM2-150M masked-marginal scores, prior likelihood, or lightweight heuristics, allowing different design objectives to be plugged in.

Technical Details

LineageFlow couples a shared flow-matching denoiser, trained to transport lineage priors toward extant sequences, with per-family ASR priors computed from multiple sequence alignments. Training and evaluation assets are built from Pfam families: the released pipeline uses cleaned per-family FASTA training sequences, ASR-derived prior files, per-position gap-rate files, and a family-filtering table, with a released checkpoint (lineageflow-rp55.ckpt) trained on RP55 representative sequences. Sampling proceeds over the simplex with a default schedule of roughly 100 base steps and 50 final steps; rerouting runs a small number of rounds (default 3) over a population (default size 8) at an intermediate time, mutating a fraction of positions (default 25%) and selecting by a fitness scorer.

The authors evaluate generated sequences along four axes: family validity via profile-HMM scoring, foldability via mean OmegaFold pLDDT, self-consistency via ESM-IF (inverse-folding) perplexity, and novelty/diversity via MMseqs2-based clustering and identity statistics. Against flow- and diffusion-based baselines, LineageFlow reports improved structural confidence and family validity while retaining sequence diversity, supporting the claim that ancestral initialization yields higher-fidelity, family-faithful samples.

Applications

LineageFlow targets protein engineering and design workflows where the goal is to generate novel-but-valid variants of a known protein family — for example expanding an enzyme or binder family with candidates that fold reliably and retain the family fold. Because rerouting accepts arbitrary fitness scorers, practitioners can bias generation toward measurable objectives such as predicted foldability or model-scored sequence likelihood without retraining. The Pfam-based pipeline makes the approach broadly applicable across families with sufficient alignment depth for ancestral reconstruction, and generated sequences integrate naturally with downstream structure-prediction and inverse-folding tools used to triage candidates before wet-lab testing.

Impact

LineageFlow contributes to a growing line of work that replaces uninformative noise priors in generative sequence models with biologically structured initializations, here drawing on phylogenetics and ancestral sequence reconstruction. By demonstrating that lineage priors improve family validity and structural confidence over flow- and diffusion-based baselines, and by showing that a single intermediate-time rerouting step can guide generation cheaply, the work offers a practical recipe for family-aware protein design. As a recent preprint accepted at ICML 2026, its real-world adoption and experimental validation remain to be established; reported gains are computational (profile-HMM, OmegaFold pLDDT, ESM-IF perplexity, MMseqs2 diversity), and the method's dependence on alignment-derived ASR priors may limit applicability to families with sparse or low-quality alignments.

Citation

LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Preprint

Liang, L., et al. (2026) LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation.

DOI: 10.48550/arXiv.2605.22252

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations152

Influential21

References80

GitHub

Stars3

Forks0

Open Issues0

Contributors1

Last Push1mo ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

64Partial

Usability — can I run it?95

Reproducibility — can I retrain it?26

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Lineage-prior initialization: Generation starts from ancestral-reconstruction priors rather than random noise, anchoring samples to an evolved scaffold and improving family validity.

Dirichlet flow matching: A flow-matching denoiser operating on the probability simplex over amino acids transports the ancestral prior toward extant sequence distributions.

Rerouting for objective-guided sampling: A single intermediate-time mutate–select–amplify step (population-based) steers sequences toward a target objective without per-step gradient or predictor guidance.

Family-aware diversity: Produces sequences that remain recognizable family members while exploring within-family variation, avoiding both family drift and consensus collapse.

Pluggable fitness scorers: Rerouting can be guided by ESM2-150M masked-marginal scores, prior likelihood, or lightweight heuristics, allowing different design objectives to be plugged in.

Technical Details

Applications

Impact

LineageFlow

Key Features

Technical Details

Applications

Impact

Citation

LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LineageFlow

Key Features

Technical Details

Applications

Impact

Citation

LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LineageFlow

#Key Features

#Technical Details

#Applications

#Impact

Citation

LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LineageFlow

#Key Features

#Technical Details

#Applications

#Impact

Citation

LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact