bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

LineageFlow

Griffith University / Monash University / Fudan University

A Dirichlet flow-matching model that generates family-aware protein sequences by initializing from ancestral-reconstruction lineage priors rather than random noise.

Released: May 2026

LineageFlow is a generative model for protein sequence design that produces sequences which are both biophysically plausible and recognizable members of a target protein family while still exploring meaningful within-family diversity. Introduced in a 2026 preprint (arXiv:2605.22252, accepted at ICML 2026) by researchers at Griffith University, Monash University, and Fudan University, it reframes a common limitation of diffusion- and flow-based sequence generators: starting from uninformative random noise tends to either drift outside the family or collapse toward over-conserved consensus sequences.

The model's central idea is to replace the random starting distribution with a lineage prior derived from ancestral sequence reconstruction (ASR). Rather than denoising from scratch, LineageFlow begins from a phylogeny-informed ancestral distribution and transports it toward the space of extant (present-day) sequences using a shared Dirichlet flow-matching denoiser. This grounds generation in an evolved scaffold, so the model performs structured mutation from an evolutionarily reasonable starting point instead of inventing a family from noise.

LineageFlow also introduces "rerouting," a single intermediate-time mutate–select–amplify intervention that steers sampling toward a user-defined objective without requiring per-step predictor guidance. This makes objective-aware generation lightweight while preserving the family-validity benefits of the lineage prior.

#Key Features

  • Lineage-prior initialization: Generation starts from ancestral-reconstruction priors rather than random noise, anchoring samples to an evolved scaffold and improving family validity.
  • Dirichlet flow matching: A flow-matching denoiser operating on the probability simplex over amino acids transports the ancestral prior toward extant sequence distributions.
  • Rerouting for objective-guided sampling: A single intermediate-time mutate–select–amplify step (population-based) steers sequences toward a target objective without per-step gradient or predictor guidance.
  • Family-aware diversity: Produces sequences that remain recognizable family members while exploring within-family variation, avoiding both family drift and consensus collapse.
  • Pluggable fitness scorers: Rerouting can be guided by ESM2-150M masked-marginal scores, prior likelihood, or lightweight heuristics, allowing different design objectives to be plugged in.

#Technical Details

LineageFlow couples a shared flow-matching denoiser, trained to transport lineage priors toward extant sequences, with per-family ASR priors computed from multiple sequence alignments. Training and evaluation assets are built from Pfam families: the released pipeline uses cleaned per-family FASTA training sequences, ASR-derived prior files, per-position gap-rate files, and a family-filtering table, with a released checkpoint (lineageflow-rp55.ckpt) trained on RP55 representative sequences. Sampling proceeds over the simplex with a default schedule of roughly 100 base steps and 50 final steps; rerouting runs a small number of rounds (default 3) over a population (default size 8) at an intermediate time, mutating a fraction of positions (default 25%) and selecting by a fitness scorer.

The authors evaluate generated sequences along four axes: family validity via profile-HMM scoring, foldability via mean OmegaFold pLDDT, self-consistency via ESM-IF (inverse-folding) perplexity, and novelty/diversity via MMseqs2-based clustering and identity statistics. Against flow- and diffusion-based baselines, LineageFlow reports improved structural confidence and family validity while retaining sequence diversity, supporting the claim that ancestral initialization yields higher-fidelity, family-faithful samples.

#Applications

LineageFlow targets protein engineering and design workflows where the goal is to generate novel-but-valid variants of a known protein family — for example expanding an enzyme or binder family with candidates that fold reliably and retain the family fold. Because rerouting accepts arbitrary fitness scorers, practitioners can bias generation toward measurable objectives such as predicted foldability or model-scored sequence likelihood without retraining. The Pfam-based pipeline makes the approach broadly applicable across families with sufficient alignment depth for ancestral reconstruction, and generated sequences integrate naturally with downstream structure-prediction and inverse-folding tools used to triage candidates before wet-lab testing.

#Impact

LineageFlow contributes to a growing line of work that replaces uninformative noise priors in generative sequence models with biologically structured initializations, here drawing on phylogenetics and ancestral sequence reconstruction. By demonstrating that lineage priors improve family validity and structural confidence over flow- and diffusion-based baselines, and by showing that a single intermediate-time rerouting step can guide generation cheaply, the work offers a practical recipe for family-aware protein design. As a recent preprint accepted at ICML 2026, its real-world adoption and experimental validation remain to be established; reported gains are computational (profile-HMM, OmegaFold pLDDT, ESM-IF perplexity, MMseqs2 diversity), and the method's dependence on alignment-derived ASR priors may limit applicability to families with sparse or low-quality alignments.

Citation

Preprint

DOI: 10.48550/arXiv.2605.22252

DOI: 10.48550/arXiv.2605.22252

Openness

Unclassified
Missing required components

Tags

flow_matchinggenerative_modelprotein_designprotein_engineeringprotein_familyprotein_sequenceprotein_sequence_generationtransformer

Resources

GitHub RepositoryResearch Paper