BioEmu-1

Generative model that emulates protein equilibrium ensembles, sampling cryptic pockets and unfolded states far faster than molecular dynamics.

Released: December 2024

BioEmu-1 is a biomolecular emulator developed by Microsoft Research that addresses a longstanding challenge in structural biology: proteins are not static entities but dynamic ensembles of conformations, and capturing that equilibrium distribution is computationally intractable at scale using conventional simulation methods. Rather than predicting a single lowest-energy structure, BioEmu-1 generates diverse conformational ensembles that faithfully represent the equilibrium distribution of a protein in solution, including cryptic pockets, partially unfolded states, and large-scale domain rearrangements.

The model achieves this by training on three complementary data sources: static structures from the AlphaFold Database, over 200 milliseconds of cumulative molecular dynamics (MD) simulation data spanning thousands of proteins, and experimental protein folding stability measurements. This multi-source training strategy allows BioEmu-1 to learn both the geometric diversity of protein conformational space and the thermodynamic weighting of states within it. Released as a preprint in December 2024, BioEmu-1 represents a significant step toward replacing or augmenting traditional MD simulations for routine research tasks.

Key Features

Ultra-fast ensemble sampling: Generates over 1,000 statistically independent protein conformations per hour on a single GPU, approximately 100,000 times faster than conventional molecular dynamics simulations.
Thermodynamic accuracy: Predicts relative free energies with approximately 1 kcal/mol accuracy compared to millisecond-scale MD simulations and experimental folding stability data.
Functional motion capture: Samples diverse conformational changes including cryptic pocket formation, local unfolding events, and large-scale domain rearrangements that are invisible to static structure prediction.
Multi-source training: Integrates AlphaFold Database structures, 200+ milliseconds of MD trajectories (reweighted using Markov State Models for proper equilibrium distributions), and experimental stability measurements.
Mechanistic interpretability: Provides per-conformation structural data that can reveal causes of mutant destabilization, allosteric pathways, and binding site dynamics.

Technical Details

BioEmu-1 uses a diffusion-based generative architecture called DiG (Diffusion in Geometry), employing flow matching to learn the equilibrium distribution of protein conformations. Sequence information conditions the generative process, and structure generation proceeds through 30 to 50 denoising steps to produce high-quality conformational samples.

Training followed a multi-stage protocol. In the pretraining phase, the model was trained with denoising score matching on flexible protein structures from the AlphaFold Database. Fine-tuning then proceeded in two parallel tracks: additional denoising score matching on MD trajectories, and Property Prediction Fine-Tuning (PPFT) to align predicted ensemble thermodynamics with experimental folding free energies. MD trajectories were reweighted using Markov State Models to ensure the sampled conformations reflect true equilibrium populations rather than kinetic artifacts. The resulting model runs efficiently on single-GPU hardware, making ensemble generation accessible without high-performance computing infrastructure.

Applications

BioEmu-1 is particularly valuable in drug discovery contexts where transient or cryptic binding pockets must be identified — pockets that appear only in specific conformational states and are missed entirely by single-structure prediction tools like AlphaFold 2. Medicinal chemists can screen protein-ligand interactions across conformational ensembles to estimate affinity variation, and allosteric sites become visible through conformational clustering analysis. In protein engineering, BioEmu-1 supports rational design by predicting how mutations affect the stability landscape and sampling of functional states, enabling prioritization of variants before expensive wet-lab characterization. Structural biologists and computational researchers can use the model to generate hypotheses about folding mechanisms, study domain motion in multi-domain proteins, and benchmark or supplement traditional MD force fields.

Impact

BioEmu-1 represents one of the first generative models to combine statistical accuracy with the throughput required for proteome-scale conformational analysis. Its 100,000-fold speedup relative to MD simulation could substantially lower the barrier to studying protein dynamics for laboratories without access to dedicated simulation resources. The model's ability to accurately predict relative free energies — a long-standing benchmark challenge — demonstrates that deep learning can now meaningfully replace certain categories of physics-based simulation rather than merely complement them. As a preprint from December 2024, BioEmu-1 has not yet undergone formal peer review, and its performance on proteins outside the training distribution (e.g., membrane proteins, intrinsically disordered regions, or very large complexes) remains to be thoroughly characterized. The release of model weights and code on GitHub facilitates community evaluation and downstream development.

Citation

Scalable emulation of protein equilibrium ensembles with generative deep learning

Preprint

Lewis, S., Hempel, T., Jiménez-Luna, J., Gastegger, M., Xie, Y., Foong, A. Y. K., et al. (2024). Scalable emulation of protein equilibrium ensembles with generative deep learning. bioRxiv.

DOI: 10.1101/2024.12.05.626885

Recent citations

Papers that recently cited this model.

Benchmarking AI Protein Structure Predictors Reveals a Persistent Bias in Multi-State Proteins
Muhui Ye, Yu-Hong Wang, M. Brogi, et al.
bioRxiv · Jul 2026
0
TPS-Flow: Physics-Guided Flow-Based Generative Modeling of Protein Transition Paths.
Kai Xu, Likun Zhao, Yanan Tian, et al.
Journal of Chemical Information and Modeling · Jul 2026
0
From First Principles to Function: How AI Is Reshaping Enzyme Design.
Sebastian Lindner, Florence J. Hardy, Donald Hilvert
Biochemistry · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Navigating protein landscapes with a machine-learned transferable coarse-grained model
N. Charron, Klara Bonneau, Aldo S. Pasos-Trejo, et al.
Nature Chemistry · Oct 2023
60
A Fourier Space Perspective on Diffusion Models
Fabian Falck, T. Pandeva, Kiarash Zahirnia, et al.
arXiv.org · May 2025
42
Beyond static structures: protein dynamic conformations modeling in the post-AlphaFold era
Xinyue Cui, Lingyu Ge, Xia Chen, et al.
Briefings Bioinform. · Jul 2025
37
Scalable Equilibrium Sampling with Sequential Boltzmann Generators
Charlie B. Tan, A. Bose, Chen Lin, et al.
International Conference on Machine Learning · Feb 2025
34
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
Michael Plainer, Hao Wu, Leon Klein, et al.
arXiv.org · Jun 2025
27

Citations

Total Citations340

Influential35

References119

GitHub

Stars855

Forks145

Open Issues1

Contributors13

Last Push3d ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes24

Last Modified9mo ago

Fields of citing research

Computer Science82%
Biology79%
Medicine55%
Chemistry37%
Physics23%
Materials Science9%
Mathematics5%
Engineering3%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

71Open

Usability — can I run it?100

Reproducibility — can I retrain it?33

open weights, closed recipe

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository GitHub Repository Research Paper Research Paper HuggingFace Model Documentation Dataset Dataset Dataset

Key Features

Ultra-fast ensemble sampling: Generates over 1,000 statistically independent protein conformations per hour on a single GPU, approximately 100,000 times faster than conventional molecular dynamics simulations.

Thermodynamic accuracy: Predicts relative free energies with approximately 1 kcal/mol accuracy compared to millisecond-scale MD simulations and experimental folding stability data.

Functional motion capture: Samples diverse conformational changes including cryptic pocket formation, local unfolding events, and large-scale domain rearrangements that are invisible to static structure prediction.

Multi-source training: Integrates AlphaFold Database structures, 200+ milliseconds of MD trajectories (reweighted using Markov State Models for proper equilibrium distributions), and experimental stability measurements.

Mechanistic interpretability: Provides per-conformation structural data that can reveal causes of mutant destabilization, allosteric pathways, and binding site dynamics.

Technical Details

Applications

Impact

Citation

Scalable emulation of protein equilibrium ensembles with generative deep learning

Preprint

Lewis, S., Hempel, T., Jiménez-Luna, J., Gastegger, M., Xie, Y., Foong, A. Y. K., et al. (2024). Scalable emulation of protein equilibrium ensembles with generative deep learning. bioRxiv.

DOI: 10.1101/2024.12.05.626885

BioEmu-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Scalable emulation of protein equilibrium ensembles with generative deep learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

BioEmu-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Scalable emulation of protein equilibrium ensembles with generative deep learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact