A family of SE(3)-equivariant molecular foundation models pretrained on ~70M DFT conformers, reaching state of the art on TDC ADMET and MoleHB property-prediction benchmarks.
Suiren-1.0 is a family of molecular foundation models for organic chemistry, developed by Golab (the SAIS Physics Lab at the Shanghai Academy of Artificial Intelligence for Science) and described in a technical report first posted to arXiv in March 2026. The models are grounded in quantum chemistry: rather than learning from 2D molecular graphs or string representations alone, Suiren is pretrained on large-scale density functional theory (DFT) data so that its representations encode physically meaningful information about energies, forces, and 3D molecular geometry.
The central problem Suiren addresses is the gap between microscopic 3D conformational geometry and the macroscopic, ensemble-averaged properties that matter for downstream tasks such as ADMET prediction in drug discovery. Many practical chemistry workflows operate on 2D graphs or SMILES strings, yet the properties of interest are governed by 3D structure and quantum-mechanical behavior. Suiren bridges this gap through three coordinated variants and a distillation framework that transfers 3D structural knowledge into models that accept 2D inputs.
The family comprises Suiren-Base (a 1.8-billion-parameter equivariant backbone pretrained on quantum-chemical conformers), Suiren-Dimer (continued pretraining on intermolecular-interaction data), and Suiren-ConfAvg (a lightweight distilled model that produces conformation-averaged embeddings from 2D graphs or SMILES). Together they target accurate, transferable molecular property prediction for both single molecules and interacting pairs.
Suiren-Base is a 1.8-billion-parameter SO(3)-equivariant graph neural network built on an EquiformerV2 backbone augmented with the Equivariant Spherical Transformer. It uses a mixture-of-experts design — reported as 20 layers each combining S2Activation and EST experts — and is pretrained with Equivariant Masked Position Prediction (EMPP), a self-supervised objective in which atoms are removed and their coordinates reconstructed conditioned on atom type and target energy. Pretraining draws on the Qo2mol dataset of approximately 70 million DFT conformers; Suiren-Dimer adds continued pretraining on roughly 13.5 million intermolecular-interaction samples. On the MoleHB benchmark, Suiren reports state-of-the-art mean absolute error on 41 of 43 properties, with gains exceeding 20% on more than 20 tasks. On the Therapeutics Data Commons (TDC) ADMET group, it reports top-ranked results on 8 of 18 metrics and second place on 4 more, achieved with a single fixed training configuration rather than per-task hyperparameter tuning.
Suiren is aimed at computational chemists and drug-discovery teams who need accurate molecular property predictions across the ADMET spectrum — absorption, distribution, metabolism, excretion, and toxicity — as well as researchers studying quantum-chemical properties and intermolecular interactions. Because Suiren-ConfAvg accepts 2D graphs or SMILES, it slots into standard cheminformatics pipelines while retaining 3D-aware structural knowledge, making it practical for virtual screening and lead optimization. Suiren-Dimer extends the family to interaction-dependent tasks such as binding and association behavior between molecular pairs.
Suiren contributes to a growing class of physics-grounded molecular foundation models that ground learning in DFT-level quantum data rather than relying solely on 2D structure or empirical labels. Its strong, hyperparameter-free results across TDC ADMET and MoleHB suggest that conformer-scale quantum pretraining yields broadly transferable representations for downstream chemistry. The open release of all three checkpoints under a Modified MIT license lowers the barrier for adoption in drug-discovery research. As a technical report, the work has not undergone peer review, and the full Qo2mol pretraining corpus is not yet completely open-sourced, which constrains exact reproduction of the pretraining stage.