AlphaFlow

Protein conformational ensemble generator that fine-tunes AlphaFold 2 with flow matching, sampling protein dynamics beyond a single static structure.

Released: February 2024

AlphaFlow is a method for generating protein conformational ensembles by repurposing AlphaFold 2 — originally a deterministic single-state structure predictor — as a flow-based generative model. Developed by Bowen Jing, Bonnie Berger, and Tommi Jaakkola at MIT CSAIL and published at ICML 2024 as an oral presentation, AlphaFlow addresses a fundamental limitation of structure prediction systems: proteins are not rigid objects but dynamic molecules that visit multiple conformations, and those conformational transitions are central to function. Enzyme catalysis, allosteric regulation, signal transduction, and molecular recognition all depend on protein flexibility that a single predicted structure cannot capture.

Traditional computational approaches to conformational sampling, primarily molecular dynamics (MD) simulation, require substantial computational resources and long simulation times to adequately sample equilibrium distributions for even moderately sized proteins. AlphaFold 2 itself offers a partial workaround — subsampling the multiple sequence alignment (MSA) to generate diverse structures — but this approach is limited in the physical realism of the resulting ensembles and does not formally correspond to equilibrium sampling. AlphaFlow takes a different route: it fine-tunes AlphaFold's weights directly under a flow matching training objective, transforming the deterministic predictor into a sequence-conditioned generative model that can draw statistically meaningful samples from the protein conformational distribution.

The same framework was applied to ESMFold to produce a companion model called ESMFlow, which requires no MSA at inference and can therefore generate ensembles for sequences with no known homologs, at substantially lower computational cost than AlphaFlow. Both models are available through the open-source repository with pretrained weights for models trained on the Protein Data Bank and on all-atom MD trajectories from the ATLAS dataset.

Key Features

Flow matching fine-tuning: Adapts AlphaFold's pretrained weights under a flow matching objective that defines a probability path from a prior distribution of noisy coordinates to the target distribution of folded conformations, enabling principled sampling of multiple structures per sequence.
Two training regimes: PDB-trained models learn the distribution of experimentally observed structures (capturing crystallographic and cryo-EM diversity); MD-trained models additionally learn from all-atom molecular dynamics trajectories, capturing thermal fluctuations and transient states at physiological temperatures.
ESMFlow variant: An ESMFold-based counterpart that requires no MSA input, enabling conformational ensemble generation for orphan sequences or in MSA-scarce regimes where AlphaFold's performance would be degraded.
Ensemble quality metrics: Generates ensembles evaluated against MD ground truth using per-residue RMSF correlations, mean absolute error in pairwise RMSD, and Wasserstein-2 distance of principal component projections — metrics that reflect distributional accuracy beyond single-structure RMSD.
Superior precision-diversity balance: Outperforms AlphaFold with MSA subsampling (a common heuristic for ensemble generation) in jointly achieving high structural accuracy and meaningful conformational diversity on held-out test proteins.
Open weights and code: Pretrained model weights and training code are publicly available, enabling the community to fine-tune on custom MD datasets for specific protein families.

Technical Details

AlphaFlow is built on AlphaFold 2's 93-million-parameter architecture comprising the Evoformer stack and Structure Module. The flow matching framework defines a conditional probability path between a prior distribution (Gaussian-noised coordinates) and the data distribution (real protein conformations). During training, the model is presented with protein structures at intermediate noise levels and learns to predict the velocity field that denoises coordinates toward real structures. Fine-tuning starts from the AlphaFold 2 checkpoint, adapting the weights under this generative objective while preserving the evolutionary and structural representations learned during the original training.

Two versions of the MD-trained model exist: one trained on the ATLAS dataset of all-atom MD simulations of monomeric proteins at 300 K, and one trained on the PDB alone. When evaluated on proteins structurally dissimilar from the ATLAS training set, AlphaFlow-MD substantially outperforms both MSA subsampling approaches and AlphaFold with random seeds in residue-wise RMSF Pearson correlation, capturing biologically meaningful flexibility patterns that static predictors miss. The model samples multiple conformations per sequence by drawing from the flow at inference, with each sample generated in a single forward pass through the network after establishing the flow trajectory. Sampling is run on GPU hardware comparable to standard AlphaFold inference, with ensemble generation feasible in minutes for typical-sized proteins.

Applications

AlphaFlow is primarily useful in contexts where protein dynamics, not just a single representative structure, are the object of interest. Structural biologists use it to generate conformational priors that aid interpretation of cryo-EM datasets where multiple conformational states co-exist in the particle pool. Biophysicists and computational chemists use AlphaFlow-generated ensembles as starting points for more targeted MD simulations, reducing the cost of equilibration by initializing from physically realistic conformations. Drug discovery teams studying intrinsically disordered proteins, allosteric modulators, or proteins with flexible binding sites benefit from ensemble predictions that reveal cryptic pockets absent in crystal structures. For proteins with no homologs or sparse MSAs, ESMFlow provides conformational sampling without the MSA requirement that AlphaFold demands, extending ensemble modeling to poorly characterized sequence space. Researchers benchmarking or developing new MD force fields can use AlphaFlow ensembles as a fast computational reference for ensemble quality comparison.

Impact

AlphaFlow demonstrated that a deterministic structure predictor of AlphaFold's accuracy could be successfully repurposed as a generative model through flow matching fine-tuning, establishing a methodological blueprint subsequently adopted for related tasks in structural biology. Its ICML 2024 oral presentation status reflects the community's recognition of this contribution at the intersection of machine learning and computational biology. The AlphaFlow paper inspired direct follow-up work, including AlphaFlow-Lit (arXiv 2407.12053), which achieves approximately 47-fold faster sampling by fine-tuning only the lightweight Structure Module while freezing the Evoformer, making ensemble generation substantially more tractable for high-throughput applications. A key limitation is that AlphaFlow's conformational diversity depends on the training ensemble distribution — if the training MD data does not cover certain slow conformational transitions (e.g., large-scale domain motions or disordered-to-ordered transitions), the generated ensembles will not capture those states. Additionally, like AlphaFold, AlphaFlow does not natively model ligand-bound conformational changes or membrane environments.

Citation

AlphaFold Meets Flow Matching for Generating Protein Ensembles

Preprint

Jing, B., et al. (2024) AlphaFold Meets Flow Matching for Generating Protein Ensembles. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2402.04845

Recent citations

Papers that recently cited this model.

RINAMI: Residue‐attributed interpretable neural network for predicting absolute folding free energy by merging structure and sequence information
Naoki Tomita, G. Chikenji
Protein Science · Jul 2026
0
Spectral Diffusion for Protein Dynamics
Hew Phipps, M. Cagiada, S. Villalba, et al.
Jul 2026
0
AquaGen: Scaling generative models to molecular dynamics precision on thousands of atoms
Emmanuel Bengio, Sanjeev Raja, Yui Tik Pang, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

Scalable emulation of protein equilibrium ensembles with generative deep learning
Sarah Lewis, Tim Hempel, José Jiménez-Luna, et al.
bioRxiv · Feb 2025
293
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Yiyang Ma, Xingchao Liu, Xiaokang Chen, et al.
Computer Vision and Pattern Recognition · Nov 2024
138
Generative Modeling of Molecular Dynamics Trajectories
Bowen Jing, Hannes Stärk, T. Jaakkola, et al.
Neural Information Processing Systems · Sep 2024
75Influential
Diffusion models in protein structure and docking
Jason Yim, Hannes Stärk, Gabriele Corso, et al.
WIREs Computational Molecular Science · Mar 2024
72
AlphaFold3 versus experimental structures: assessment of the accuracy in ligand-bound G protein-coupled receptors
Xin-heng He, Jun-Rui Li, Shi-yi Shen, et al.
Acta Pharmacologica Sinica · Dec 2024
69

Citations

Total Citations259

Influential31

References66

GitHub

Stars536

Forks80

Open Issues31

Contributors8

Last Push4mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified4mo ago

Fields of citing research

Computer Science94%
Biology77%
Medicine34%
Chemistry25%
Physics14%
Materials Science6%
Mathematics5%
Engineering3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

79Open

Usability — can I run it?99

Reproducibility — can I retrain it?66

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Flow matching fine-tuning: Adapts AlphaFold's pretrained weights under a flow matching objective that defines a probability path from a prior distribution of noisy coordinates to the target distribution of folded conformations, enabling principled sampling of multiple structures per sequence.

Two training regimes: PDB-trained models learn the distribution of experimentally observed structures (capturing crystallographic and cryo-EM diversity); MD-trained models additionally learn from all-atom molecular dynamics trajectories, capturing thermal fluctuations and transient states at physiological temperatures.

ESMFlow variant: An ESMFold-based counterpart that requires no MSA input, enabling conformational ensemble generation for orphan sequences or in MSA-scarce regimes where AlphaFold's performance would be degraded.

Ensemble quality metrics: Generates ensembles evaluated against MD ground truth using per-residue RMSF correlations, mean absolute error in pairwise RMSD, and Wasserstein-2 distance of principal component projections — metrics that reflect distributional accuracy beyond single-structure RMSD.

Superior precision-diversity balance: Outperforms AlphaFold with MSA subsampling (a common heuristic for ensemble generation) in jointly achieving high structural accuracy and meaningful conformational diversity on held-out test proteins.

Open weights and code: Pretrained model weights and training code are publicly available, enabling the community to fine-tune on custom MD datasets for specific protein families.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

RINAMI: Residue‐attributed interpretable neural network for predicting absolute folding free energy by merging structure and sequence information

Naoki Tomita, G. Chikenji

Protein Science · Jul 2026

Spectral Diffusion for Protein Dynamics

Hew Phipps, M. Cagiada, S. Villalba, et al.

Jul 2026

AquaGen: Scaling generative models to molecular dynamics precision on thousands of atoms

Emmanuel Bengio, Sanjeev Raja, Yui Tik Pang, et al.

Jul 2026

AlphaFlow

#Key Features

#Technical Details

#Applications

#Impact

Citation

AlphaFold Meets Flow Matching for Generating Protein Ensembles

Recent citations

Spectral Diffusion for Protein Dynamics

AquaGen: Scaling generative models to molecular dynamics precision on thousands of atoms

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

AlphaFlow

#Key Features

#Technical Details

#Applications

#Impact

Citation

AlphaFold Meets Flow Matching for Generating Protein Ensembles

Recent citations

Spectral Diffusion for Protein Dynamics

AquaGen: Scaling generative models to molecular dynamics precision on thousands of atoms

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact