UBio-MolFM

Universal all-atom machine-learning force field for molecular dynamics, with ab initio-level accuracy on solvated biomolecules of ~1,500 atoms.

Released: February 2026

Machine-learning interatomic potentials (MLIPs) promise the holy grail of molecular simulation: the accuracy of quantum-mechanical methods like density functional theory (DFT) at a fraction of the cost. For biology, though, a persistent "scale–accuracy gap" remains — the systems that matter (solvated proteins, ions, peptides) are large and heterogeneous, while the high-fidelity quantum data needed to train accurate force fields is easiest to generate for small, gas-phase molecules. Closing that gap requires both the right training data and an architecture that can stay accurate as systems grow.

UBio-MolFM, developed by the UBio Team at IQuestLab (IQuest Research) and posted to arXiv in February 2026, is an all-atom machine-learning force field foundation model built specifically for biological systems. It combines a new bio-focused training dataset (UBio-Mol26), a linear-scaling equivariant transformer backbone (E2Former-V2), and a three-stage curriculum-learning protocol, with the explicit goal of delivering ab initio-level fidelity on solvated, biomolecule-scale systems.

The authors report ab initio-level accuracy on biomolecular systems reaching roughly 1,500 atoms across a range of benchmarks — including liquid-water structure, ionic solvation, and peptide folding dynamics — while reproducing realistic molecular-dynamics observables. Code and pretrained checkpoints are released, distinguishing UBio-MolFM from many contemporaneous preprints.

Key Features

Bio-focused training data (UBio-Mol26): Built with a multi-fidelity "two-pronged" strategy combining systematic enumeration with sampling of native protein environments, covering systems up to ~1,200 atoms.
Linear-scaling equivariant transformer (E2Former-V2): An equivariant architecture with sparsification and long-/short-range modeling that the authors report achieves roughly 4x higher inference throughput on large systems.
Three-stage curriculum learning: Training transitions from energy initialization through force-consistency refinement, with force-focused supervision to address energy-offset issues.
Ab initio-level accuracy at scale: Validated on liquid water, ionic solvation, and peptide folding for systems up to ~1,500 atoms, with realistic MD observables.
Released code and weights: Implementation (MIT-licensed), a pretrained checkpoint (IQuest-UBio-MolFM-V1), and a protein dataset (UBio-Protein26) are publicly available.

Technical Details

UBio-MolFM pairs the E2Former-V2 backbone — an equivariant transformer with linear scaling in system size — with the UBio-Mol26 dataset, generated via a multi-fidelity enumeration plus native-environment sampling strategy spanning systems up to ~1,200 atoms. Training follows a three-stage curriculum (energy initialization, force-consistency refinement, and force-focused supervision) intended to stabilize learning and handle energy offsets. The released suite supports training, inference, and molecular-dynamics simulation, targets Python 3.12 / PyTorch 2.7.0, and supports LMDB, SPICE, and OC20 data formats; the README notes scaling to roughly 100,000 atoms on a single GPU. Reported benchmarks reach ab initio-level accuracy on biomolecular systems near 1,500 atoms. An accompanying HuggingFace dataset (UBio-Protein26, ~5 million protein structures) serves as a data card for the released checkpoint.

Applications

UBio-MolFM targets computational chemists and structural biologists who need accurate, scalable molecular dynamics of biomolecular systems — for example, simulating peptide folding, ion solvation, or protein–solvent interactions where classical force fields lack accuracy and DFT is intractable. Because the model is released with training, inference, and MD tooling, it can be deployed directly to run simulations or fine-tuned on domain-specific quantum data, lowering the barrier to first-principles-quality dynamics for larger biological assemblies.

Impact

By coupling a bio-specific multi-fidelity dataset with a linear-scaling equivariant architecture and a force-focused curriculum, UBio-MolFM is a concrete attempt to push machine-learning force fields from small molecules toward solvated, biomolecule-scale simulation. The public release of code, weights, and a large protein dataset makes the work immediately testable by the community. As a February 2026 preprint, its accuracy claims await independent benchmarking and peer review, but the combination of openness and explicit focus on biological scale is a meaningful step for the MLIP field.

Citation

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Preprint

Huang, L., et al. (2026) UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems. arXiv.org.

DOI: 10.48550/arXiv.2602.17709

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References47

GitHub

Stars33

Forks5

Open Issues1

Contributors2

Last Push3mo ago

LanguagePython

HuggingFace

Downloads7

Likes3

Last Modified3mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

81Open

Usability — can I run it?100

Reproducibility — can I retrain it?66

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Official Website HuggingFace Model

Key Features

Bio-focused training data (UBio-Mol26): Built with a multi-fidelity "two-pronged" strategy combining systematic enumeration with sampling of native protein environments, covering systems up to ~1,200 atoms.

Linear-scaling equivariant transformer (E2Former-V2): An equivariant architecture with sparsification and long-/short-range modeling that the authors report achieves roughly 4x higher inference throughput on large systems.

Three-stage curriculum learning: Training transitions from energy initialization through force-consistency refinement, with force-focused supervision to address energy-offset issues.

Ab initio-level accuracy at scale: Validated on liquid water, ionic solvation, and peptide folding for systems up to ~1,500 atoms, with realistic MD observables.

Released code and weights: Implementation (MIT-licensed), a pretrained checkpoint (IQuest-UBio-MolFM-V1), and a protein dataset (UBio-Protein26) are publicly available.

Technical Details

Applications

Impact

UBio-MolFM

Key Features

Technical Details

Applications

Impact

Citation

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

UBio-MolFM

Key Features

Technical Details

Applications

Impact

Citation

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

UBio-MolFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

UBio-MolFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact