A universal all-atom machine-learning force field foundation model with ab initio-level accuracy on solvated biomolecular systems up to ~1,500 atoms.
Machine-learning interatomic potentials (MLIPs) promise the holy grail of molecular simulation: the accuracy of quantum-mechanical methods like density functional theory (DFT) at a fraction of the cost. For biology, though, a persistent "scale–accuracy gap" remains — the systems that matter (solvated proteins, ions, peptides) are large and heterogeneous, while the high-fidelity quantum data needed to train accurate force fields is easiest to generate for small, gas-phase molecules. Closing that gap requires both the right training data and an architecture that can stay accurate as systems grow.
UBio-MolFM, developed by the UBio Team at IQuestLab (IQuest Research) and posted to arXiv in February 2026, is an all-atom machine-learning force field foundation model built specifically for biological systems. It combines a new bio-focused training dataset (UBio-Mol26), a linear-scaling equivariant transformer backbone (E2Former-V2), and a three-stage curriculum-learning protocol, with the explicit goal of delivering ab initio-level fidelity on solvated, biomolecule-scale systems.
The authors report ab initio-level accuracy on biomolecular systems reaching roughly 1,500 atoms across a range of benchmarks — including liquid-water structure, ionic solvation, and peptide folding dynamics — while reproducing realistic molecular-dynamics observables. Code and pretrained checkpoints are released, distinguishing UBio-MolFM from many contemporaneous preprints.
UBio-MolFM pairs the E2Former-V2 backbone — an equivariant transformer with linear scaling in system size — with the UBio-Mol26 dataset, generated via a multi-fidelity enumeration plus native-environment sampling strategy spanning systems up to ~1,200 atoms. Training follows a three-stage curriculum (energy initialization, force-consistency refinement, and force-focused supervision) intended to stabilize learning and handle energy offsets. The released suite supports training, inference, and molecular-dynamics simulation, targets Python 3.12 / PyTorch 2.7.0, and supports LMDB, SPICE, and OC20 data formats; the README notes scaling to roughly 100,000 atoms on a single GPU. Reported benchmarks reach ab initio-level accuracy on biomolecular systems near 1,500 atoms. An accompanying HuggingFace dataset (UBio-Protein26, ~5 million protein structures) serves as a data card for the released checkpoint.
UBio-MolFM targets computational chemists and structural biologists who need accurate, scalable molecular dynamics of biomolecular systems — for example, simulating peptide folding, ion solvation, or protein–solvent interactions where classical force fields lack accuracy and DFT is intractable. Because the model is released with training, inference, and MD tooling, it can be deployed directly to run simulations or fine-tuned on domain-specific quantum data, lowering the barrier to first-principles-quality dynamics for larger biological assemblies.
By coupling a bio-specific multi-fidelity dataset with a linear-scaling equivariant architecture and a force-focused curriculum, UBio-MolFM is a concrete attempt to push machine-learning force fields from small molecules toward solvated, biomolecule-scale simulation. The public release of code, weights, and a large protein dataset makes the work immediately testable by the community. As a February 2026 preprint, its accuracy claims await independent benchmarking and peer review, but the combination of openness and explicit focus on biological scale is a meaningful step for the MLIP field.