bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small moleculeMetabolomics

UltraNMR

Hong Kong University of Science and Technology / Hunan University / Institute of Materia Medica, CAMS & PUMC / Xiamen University / Shanghai AI Laboratory

A 120M-parameter foundation model trained on 158 million simulated 1H/13C NMR spectra that adapts simulation-learned representations to real experimental spectra for molecular structure analysis.

Released: June 2026
Parameters: 120 Million

Nuclear magnetic resonance (NMR) spectroscopy is the workhorse technique for determining the structure of small molecules, yet interpreting 1H and 13C spectra remains a slow, expertise-bound task. Machine-learning approaches have been held back by a scarcity of large, well-annotated experimental spectral datasets: high-quality real spectra are expensive to acquire and unevenly distributed across chemical space. UltraNMR addresses this bottleneck by learning from simulated spectra at scale and then transferring those representations to real measurements, a simulation-to-real adaptation strategy that sidesteps the experimental data shortage.

Introduced in a June 2026 preprint, UltraNMR is a 120-million-parameter foundation model trained on 158 million paired simulated 1H and 13C NMR spectra derived from PubChem molecules. Rather than treating spectra as isolated signals, it uses multiple domain-specific pre-training objectives to capture both intra-spectral structure (relationships among peaks within one spectrum) and inter-spectral dependencies (the complementary information shared between proton and carbon spectra of the same molecule). The result is a general-purpose spectral encoder that can be adapted to a range of downstream structure-analysis tasks.

The model was developed by a multi-institution collaboration led by the Hong Kong University of Science and Technology (Guangzhou), together with Hunan University, the Institute of Materia Medica (CAMS & PUMC), Xiamen University, and Shanghai AI Laboratory. It sits alongside other recent spectra-to-structure efforts but is distinguished by its scale and its explicit focus on closing the gap between simulated training data and real experimental spectra.

#Key Features

  • Simulation-to-real adaptation: Pre-training on 158 million simulated spectra followed by adaptation to experimental spectra lets the model overcome the scarcity of annotated real NMR data while retaining strong real-world accuracy.
  • Joint 1H/13C modeling: Domain-specific objectives capture both intra-spectral peak relationships and inter-spectral dependencies between paired proton and carbon spectra, exploiting their complementary structural information.
  • Versatile downstream adaptation: A single pre-trained backbone supports spectral library search, de novo structure elucidation, functional-group identification, and natural-product classification with consistent state-of-the-art results.
  • Large spectral vector library: The authors built a 94-million-molecule NMR spectral vector library, paired with a Faiss index, enabling fast similarity-based retrieval and nearest-neighbor structure search.
  • Validated on real discovery: UltraNMR was used to elucidate the structures of two previously unknown natural products isolated from Chinese herbal medicines, demonstrating utility beyond benchmarks.

#Technical Details

UltraNMR is a transformer-based foundation model with roughly 120 million parameters. Its training corpus comprises 158 million paired simulated 1H and 13C spectra generated from a large set of PubChem molecules; the released SimNMR-PubChem dataset spans roughly 106 million molecules and about 537 GB of metadata, conformers, and spectral vectors. Pre-training combines several domain-specific objectives, including isomer contrastive learning and sequence-to-sequence spectrum-to-structure mapping, to produce representations that transfer to experimental data. For evaluation, the team released NMRGym, a benchmark of about 270,000 molecules with five downstream tasks: spectral prediction, inverse structure prediction, molecular fingerprint prediction, functional-group classification (22 groups), and toxicity/ ADMET prediction. Across these tasks the model reports consistent state-of-the-art performance, with downstream heads trained on features from the frozen backbone. The inference package supports both greedy decoding and beam search and builds on the DreaMS spectral framework.

#Applications

UltraNMR targets chemists and analytical scientists who need to assign or elucidate small-molecule structures from NMR data. Practical uses include rapid dereplication and library matching in natural-product discovery, de novo structure proposal for unknown compounds, functional-group screening, and feeding predicted ADMET endpoints into early drug-discovery triage. Its demonstrated elucidation of two new natural products from traditional Chinese medicine illustrates direct value for phytochemistry and metabolomics workflows, where novel scaffolds are common and reference spectra are often unavailable.

#Impact

By showing that a foundation model trained largely on simulated spectra can reach state-of-the-art accuracy on real experimental data, UltraNMR offers a template for overcoming data scarcity in spectroscopy-driven structure analysis. The accompanying SimNMR-PubChem dataset, the 94-million-molecule vector library, and the NMRGym benchmark provide shared resources that lower the barrier for follow-on work in NMR machine learning. As a preprint released in June 2026, its long-term adoption is still unfolding, and licensing terms for the code and released weights remain unspecified at the time of writing, which may temper near-term reuse despite the openly available datasets.

Citation

A large-scale foundation model enables simulation-to-real adaptation for nuclear magnetic resonance-based molecular structure analysis

Preprint

Yang, C., et al. (2026) A large-scale foundation model enables simulation-to-real adaptation for nuclear magnetic resonance-based molecular structure analysis. arXiv.

DOI: 10.48550/arXiv.2606.20756

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

GitHub

Stars0
Forks0
Open Issues0
Contributors1
Last Push6d ago
LanguagePython

HuggingFace

Downloads0
Likes0
Last Modified18d ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessReproducible · reproducible, less usable
43Partial
Usability — can I run it?26
Reproducibility — can I retrain it?53
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

foundation_modelself_supervisedtransfer_learningtransformer

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDatasetDataset