Distributional Graphormer

Deep learning framework predicting equilibrium distributions of molecular systems, enabling efficient ensemble generation and conformation sampling.

Released: May 2024

Structure prediction — the determination of a protein's three-dimensional shape from its amino acid sequence — has been transformed by deep learning, with systems like AlphaFold 2 routinely producing single-structure predictions at near-experimental accuracy. But a static single structure is not the complete picture of a protein's behavior. Under physiological conditions, proteins exist as dynamic ensembles of conformations, fluctuating between functional states that are thermally accessible at body temperature. The properties that determine whether a drug binds, whether an enzyme is active, or whether a protein aggregates are governed not by any single conformation but by the statistical distribution over all accessible conformations — what statistical mechanics calls the Boltzmann distribution or equilibrium distribution. Computing this distribution accurately requires either extraordinarily long molecular dynamics simulations or extensive experimental characterization, both of which are impractical at the scale required for drug discovery and protein engineering.

Distributional Graphormer (DiG), published in Nature Machine Intelligence in 2024 by Microsoft Research, reframes the protein structure problem from single-structure prediction to equilibrium distribution prediction. Rather than producing one structure, DiG generates a statistical ensemble of conformations that approximates the true thermodynamic equilibrium distribution for a given molecular system. This shift in objective — from point prediction to distributional prediction — is significant both scientifically and practically. Scientifically, it connects computational structure prediction to the thermodynamic foundation of molecular biology, acknowledging that function is a property of distributions rather than structures. Practically, it provides computational access to conformational variability information that was previously obtainable only through microsecond-scale molecular dynamics runs or ensemble-based experimental methods such as NMR relaxation or solution X-ray scattering.

DiG is inspired by the thermodynamic process of annealing, in which a disordered high-energy system gradually relaxes to a low-energy ordered state. The framework uses deep neural networks to simulate a diffusion process that transforms a simple initial distribution (a Gaussian) progressively toward the complex, multi-modal equilibrium distribution of the target molecular system. Because samples from the learned distribution are generated independently in parallel rather than sequentially as in molecular dynamics, DiG can produce diverse conformational ensembles orders of magnitude faster than simulation-based approaches. The framework was developed by a team at Microsoft Research and published as a preprint in June 2023 (arXiv:2306.05445) before appearing in Nature Machine Intelligence in 2024.

Key Features

Equilibrium distribution prediction: DiG predicts the full equilibrium distribution of molecular conformations rather than a single structure, capturing the thermally accessible conformational ensemble that governs macroscopic biological properties such as binding affinity, allostery, and enzyme kinetics.
Graphormer backbone: The diffusion process is parameterized by a Graphormer-based neural network — a transformer architecture with graph-structured molecular representations — that has demonstrated strong performance on molecular property prediction tasks and can generalize across protein families and small molecule classes.
Physics-informed diffusion pre-training (PIDP): For molecular systems where experimental or simulation structural data is scarce, DiG can be pre-trained using physics-based energy functions (force fields) rather than structural observations, allowing the model to learn physically reasonable conformational distributions even in data-poor regimes.
Independent parallel sampling: Unlike molecular dynamics, where each new conformation is generated sequentially from the previous one, DiG generates ensemble members independently in parallel, enabling efficient batch generation of diverse conformational snapshots on GPU hardware.
Multi-domain applicability: DiG is designed as a general molecular framework and has been demonstrated on protein conformation sampling, small-molecule ligand structure generation, catalyst-adsorbate geometry prediction, and property-guided structure optimization — covering both biomolecular and materials science applications.
Density estimation alongside sampling: In addition to generating samples from the equilibrium distribution, DiG provides a density function that assigns relative probabilities to conformations, enabling downstream calculations such as free energy differences between conformational states.

Technical Details

DiG is built on the Graphormer architecture, which was originally developed for molecular property prediction on graph-structured representations of molecules. In Graphormer, atoms are represented as nodes and bonds as edges in a molecular graph, and transformer-style attention is applied with graph-aware bias terms that encode structural information such as shortest-path distances between atoms and edge features. For DiG, this molecular graph encoder serves as the backbone of the denoising network in a diffusion model framework.

The diffusion process operates in continuous three-dimensional coordinate space: the forward process adds Gaussian noise to atomic coordinates over T steps, transforming a structured conformation into an isotropic Gaussian cloud. The reverse (generative) process is parameterized by the Graphormer backbone, which learns to denoise the corrupted coordinates conditioned on a descriptor of the molecular system — either a chemical graph for small molecules or a protein sequence for protein systems. By conditioning on the molecular descriptor rather than a single reference structure, DiG learns a distribution over conformations rather than a deterministic mapping from noise to a single structure.

For proteins specifically, the conditioning input is the amino acid sequence, which uniquely identifies the protein family without specifying any particular conformation. The model then generates diverse backbone conformations that are all consistent with the sequence, approximating the thermodynamic ensemble at physiological temperature. Training uses structural data from the PDB for proteins and from molecular dynamics trajectories and experimental databases for small molecules, supplemented by physics-informed pre-training using molecular mechanics force fields where data is limited.

The Physics-Informed Diffusion Pre-training (PIDP) approach is a notable technical contribution: rather than requiring a labeled dataset of (molecule, conformation) pairs, PIDP trains the diffusion model to generate conformations that have low energy under a specified force field. This effectively uses the force field as a teacher signal, allowing DiG to learn conformational preferences for molecular systems with no experimental structure data. Combined with fine-tuning on available structural data, PIDP enables DiG to generalize to new molecular systems more effectively than models trained on structural data alone.

Benchmark evaluations demonstrated that DiG generates protein conformational ensembles with structural diversity and statistical properties consistent with long molecular dynamics trajectories, while requiring computation time orders of magnitude shorter. On protein conformation sampling benchmarks, DiG produces ensembles that capture the principal conformational transitions observed in MD simulations, including domain motions, loop flexibility, and side-chain rotamer distributions. For ligand structure sampling, DiG generates diverse low-energy conformers with coverage competitive with established conformer generation tools.

Applications

DiG's most immediate application in structural biology is protein conformational ensemble generation for drug discovery. Structure-based drug design campaigns traditionally work from a single representative protein structure, but binding site flexibility can cause a designed inhibitor to lose potency when the protein adopts alternative conformations in solution or in the cell. DiG enables rapid generation of representative conformational ensembles that can be used for ensemble docking — testing a drug candidate against multiple receptor conformations — reducing the false positive rate of structure-based virtual screens. For fragment-based drug discovery and lead optimization, DiG-generated ensembles can identify cryptic binding sites — pockets that only appear in minor conformational states — that would be invisible in a single static structure.

In protein engineering, conformational ensemble prediction is valuable for assessing whether an engineered mutation preserves the functional conformational dynamics of the wild-type protein or disrupts them. Researchers designing thermostable enzymes, for example, can use DiG to check whether stabilizing mutations inappropriately rigidify the active site in a way that might impair catalytic function even while improving thermal stability. For allostery-based drug discovery — targeting proteins at sites distant from the active site — DiG's ability to model correlated conformational changes between spatially separated regions provides information that single-structure methods cannot capture. The framework also supports materials science applications: the catalyst-adsorbate sampling capability is relevant for the computational screening of heterogeneous catalysts, where the relevant binding geometry is not a single adsorption pose but a distribution over adsorption configurations.

Impact

Distributional Graphormer represents a conceptual advance in how the protein structure prediction community frames its objectives. The argument that single-structure prediction is an incomplete answer to the structure-function problem is not new — the molecular dynamics community has made this argument for decades — but DiG provides the first deep learning framework that directly targets equilibrium distribution prediction at a scale and speed that is practical for drug discovery and protein engineering workflows. Publication in Nature Machine Intelligence reflects the interdisciplinary significance of the work, bridging the machine learning and structural biology communities around a shared problem formulation. The physics-informed pre-training approach is particularly valuable as it extends the framework to the large fraction of molecular systems that lack extensive experimental or simulation structural data.

Key limitations of the current DiG framework include the challenge of accurately recovering rare, high-energy conformational states that are thermodynamically accessible but statistically underrepresented in the equilibrium distribution — these states are often biologically important (e.g., active-site open conformations for a normally closed enzyme) but are undersampled by a diffusion model trained to reproduce the bulk of the distribution. Additionally, the accuracy of DiG's equilibrium distributions depends on the quality and coverage of the training data, and for protein families with few known structures or limited MD trajectory data, the generated ensembles may not accurately reflect true conformational behavior. The Graphormer backbone also inherits the computational cost of transformer attention over molecular graphs, which may limit applicability to very large protein systems or complexes without further architectural optimization.

Citation

Predicting equilibrium distributions for molecular systems with deep learning

Zheng, S., et al. (2024) Predicting equilibrium distributions for molecular systems with deep learning. Nature Machine Intelligence.

DOI: 10.1038/s42256-024-00837-3

Recent citations

Papers that recently cited this model.

TPS-Flow: Physics-Guided Flow-Based Generative Modeling of Protein Transition Paths.
Kai Xu, Likun Zhao, Yanan Tian, et al.
Journal of Chemical Information and Modeling · Jul 2026
0
Knowledge Distillation of a Protein Language Model Yields a Foundational Implicit Solvent Model.
J. Airas, Bin Zhang
Journal of Chemical Theory and Computation · Jul 2026
0
SteerAF: Distogram-based Steering of AlphaFold2 toward Alternative Conformations
Jiajun Tang, Zefeng Zhu, Song Yang, et al.
bioRxiv · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Scalable emulation of protein equilibrium ensembles with generative deep learning
Sarah Lewis, Tim Hempel, José Jiménez-Luna, et al.
bioRxiv · Feb 2025
293Influential
Structure prediction of alternative protein conformations
P. Bryant
bioRxiv · Sep 2023
90
Transferable Boltzmann Generators
Leon Klein, Frank No'e
Neural Information Processing Systems · Jun 2024
53
Biophysics-based protein language models for protein engineering
Sam Gelman, B. Johnson, Chase R. Freschlin, et al.
Nature Methods · Sep 2025
48
Predicting protein conformational motions using energetic frustration analysis and AlphaFold2
Xingyue Guan, Qian-Yuan Tang, Weitong Ren, et al.
Proceedings of the National Academy of Sciences of the United States of America · Aug 2024
47

Citations

Total Citations158

Influential5

References33

GitHub

Stars2.5K

Forks375

Open Issues100

Contributors14

Last Push1mo ago

LanguagePython

LicenseMIT

Fields of citing research

Computer Science92%
Biology69%
Medicine53%
Chemistry29%
Physics24%
Materials Science8%
Engineering5%
Mathematics4%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

46Partial

Usability — can I run it?66

Reproducibility — can I retrain it?19

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Research Paper Official Website Documentation Link

Key Features

Equilibrium distribution prediction: DiG predicts the full equilibrium distribution of molecular conformations rather than a single structure, capturing the thermally accessible conformational ensemble that governs macroscopic biological properties such as binding affinity, allostery, and enzyme kinetics.

Graphormer backbone: The diffusion process is parameterized by a Graphormer-based neural network — a transformer architecture with graph-structured molecular representations — that has demonstrated strong performance on molecular property prediction tasks and can generalize across protein families and small molecule classes.

Physics-informed diffusion pre-training (PIDP): For molecular systems where experimental or simulation structural data is scarce, DiG can be pre-trained using physics-based energy functions (force fields) rather than structural observations, allowing the model to learn physically reasonable conformational distributions even in data-poor regimes.

Independent parallel sampling: Unlike molecular dynamics, where each new conformation is generated sequentially from the previous one, DiG generates ensemble members independently in parallel, enabling efficient batch generation of diverse conformational snapshots on GPU hardware.

Multi-domain applicability: DiG is designed as a general molecular framework and has been demonstrated on protein conformation sampling, small-molecule ligand structure generation, catalyst-adsorbate geometry prediction, and property-guided structure optimization — covering both biomolecular and materials science applications.

Density estimation alongside sampling: In addition to generating samples from the equilibrium distribution, DiG provides a density function that assigns relative probabilities to conformations, enabling downstream calculations such as free energy differences between conformational states.

Technical Details

Applications

Impact

Distributional Graphormer

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting equilibrium distributions for molecular systems with deep learning

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Distributional Graphormer

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting equilibrium distributions for molecular systems with deep learning

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact