BioEmu-1 is a biomolecular emulator developed by Microsoft Research that addresses a longstanding challenge in structural biology: proteins are not static entities but dynamic ensembles of conformations, and capturing that equilibrium distribution is computationally intractable at scale using conventional simulation methods. Rather than predicting a single lowest-energy structure, BioEmu-1 generates diverse conformational ensembles that faithfully represent the equilibrium distribution of a protein in solution, including cryptic pockets, partially unfolded states, and large-scale domain rearrangements.
The model achieves this by training on three complementary data sources: static structures from the AlphaFold Database, over 200 milliseconds of cumulative molecular dynamics (MD) simulation data spanning thousands of proteins, and experimental protein folding stability measurements. This multi-source training strategy allows BioEmu-1 to learn both the geometric diversity of protein conformational space and the thermodynamic weighting of states within it. Released as a preprint in December 2024, BioEmu-1 represents a significant step toward replacing or augmenting traditional MD simulations for routine research tasks.
BioEmu-1 uses a diffusion-based generative architecture called DiG (Diffusion in Geometry), employing flow matching to learn the equilibrium distribution of protein conformations. Sequence information conditions the generative process, and structure generation proceeds through 30 to 50 denoising steps to produce high-quality conformational samples.
Training followed a multi-stage protocol. In the pretraining phase, the model was trained with denoising score matching on flexible protein structures from the AlphaFold Database. Fine-tuning then proceeded in two parallel tracks: additional denoising score matching on MD trajectories, and Property Prediction Fine-Tuning (PPFT) to align predicted ensemble thermodynamics with experimental folding free energies. MD trajectories were reweighted using Markov State Models to ensure the sampled conformations reflect true equilibrium populations rather than kinetic artifacts. The resulting model runs efficiently on single-GPU hardware, making ensemble generation accessible without high-performance computing infrastructure.
BioEmu-1 is particularly valuable in drug discovery contexts where transient or cryptic binding pockets must be identified — pockets that appear only in specific conformational states and are missed entirely by single-structure prediction tools like AlphaFold 2. Medicinal chemists can screen protein-ligand interactions across conformational ensembles to estimate affinity variation, and allosteric sites become visible through conformational clustering analysis. In protein engineering, BioEmu-1 supports rational design by predicting how mutations affect the stability landscape and sampling of functional states, enabling prioritization of variants before expensive wet-lab characterization. Structural biologists and computational researchers can use the model to generate hypotheses about folding mechanisms, study domain motion in multi-domain proteins, and benchmark or supplement traditional MD force fields.
BioEmu-1 represents one of the first generative models to combine statistical accuracy with the throughput required for proteome-scale conformational analysis. Its 100,000-fold speedup relative to MD simulation could substantially lower the barrier to studying protein dynamics for laboratories without access to dedicated simulation resources. The model's ability to accurately predict relative free energies — a long-standing benchmark challenge — demonstrates that deep learning can now meaningfully replace certain categories of physics-based simulation rather than merely complement them. As a preprint from December 2024, BioEmu-1 has not yet undergone formal peer review, and its performance on proteins outside the training distribution (e.g., membrane proteins, intrinsically disordered regions, or very large complexes) remains to be thoroughly characterized. The release of model weights and code on GitHub facilitates community evaluation and downstream development.
Lewis, S., Hempel, T., Jiménez-Luna, J., Gastegger, M., Xie, Y., Foong, A. Y. K., et al. (2024). Scalable emulation of protein equilibrium ensembles with generative deep learning. bioRxiv.
DOI: 10.1101/2024.12.05.626885