Microsoft Research AI for Science
Unified science foundation model from Microsoft Research treating molecules, proteins, RNA, DNA, and materials as a shared sequence language for cross-domain generation.
NatureLM is a sequence-based science foundation model developed by Microsoft Research AI for Science and released in February 2025. The central thesis of the work is that entities across disparate scientific domains — small molecules, proteins, RNA, DNA, and materials — can all be represented as sequences, collectively forming what the authors term "the language of nature." By training a single generative model on this unified representation, NatureLM enables tasks that previously required separate specialist systems for each domain.
The model is trained on 143 billion tokens of curated scientific data spanning biology, chemistry, and materials science, drawn from sequence databases, structural repositories, and scientific literature. This breadth of training allows NatureLM to learn cross-domain relationships that are invisible to domain-specific models — for instance, associations between protein binding sites and the chemical properties of compatible ligands, or between RNA secondary structure and the guide RNA designs that target it.
NatureLM is available in three sizes — 1B, 8B, and 46.7B parameters (the largest being a mixture-of-experts model) — with demonstrated performance scaling across sizes. At each scale it matches or surpasses state-of-the-art specialist models on multiple benchmarks, despite being a single generalist system.
NatureLM is a GPT-style autoregressive language model extended through continued pre-training of an existing large language model on 143 billion tokens of domain-specific scientific data. The training corpus covers small molecule SMILES strings, protein sequences, RNA and DNA sequences, material structure representations, and accompanying scientific text. This design — building on a pretrained language model rather than training from scratch — allows the model to retain general language understanding while acquiring deep scientific domain knowledge.
The largest variant uses a 46.7B mixture-of-experts (MoE) architecture (equivalent to 8x7B active parameters per forward pass), consistent with the Mixtral family of models. The 1B and 8B dense variants share the same training protocol and data mixture. Benchmark evaluations span intra-domain tasks (molecule generation, protein design, RNA structure-conditioned generation, material property prediction) and cross-domain tasks (protein-to-molecule, protein-to-RNA generation). Across these evaluations, NatureLM-46.7B consistently matches or exceeds specialist models trained exclusively for each subdomain.
NatureLM is designed for researchers working at the boundaries of computational biology, medicinal chemistry, and materials science. Drug discovery teams can use it to generate hit compounds for a target protein, optimize leads for ADMET properties, and explore cross-domain hypotheses — all within a single model. Protein engineers can generate de novo sequences or design molecules conditioned on a protein of interest. RNA researchers can generate sequences for specific RNA-binding proteins or design guide RNAs for CRISPR applications. Materials scientists can specify target properties in natural language and receive candidate structures from a model that also understands biological design principles. The instruction-following interface lowers the barrier for wet-lab biologists who want to explore computational hypotheses without needing to configure and coordinate multiple specialist tools.
NatureLM represents a significant step toward a general-purpose scientific foundation model — an analogy in the biological and chemical sciences to what large language models have become for text. The work challenges the prevailing assumption that specialist models are necessary for competitive performance in each scientific subdomain, demonstrating that scale and cross-domain training can close or eliminate that gap. As a preprint published in February 2025, NatureLM has yet to accumulate the citation record of more established models, but the availability of model weights on HuggingFace and an active project website signal ongoing development. A key limitation is that the model operates on sequence representations and does not natively reason about three-dimensional structure — for tasks requiring explicit structural modeling, it would need to be combined with structure prediction tools such as AlphaFold or RoseTTAFold.
Xia, Y., et al. (2025) NatureLM: Deciphering the Language of Nature for Scientific Discovery. arXiv.org.
DOI: 10.48550/arXiv.2502.07527