Overview

NatureLM is a sequence-based science foundation model developed by Microsoft Research AI for Science and released in February 2025. The central thesis of the work is that entities across disparate scientific domains — small molecules, proteins, RNA, DNA, and materials — can all be represented as sequences, collectively forming what the authors term "the language of nature." By training a single generative model on this unified representation, NatureLM enables tasks that previously required separate specialist systems for each domain.

The model is trained on 143 billion tokens of curated scientific data spanning biology, chemistry, and materials science, drawn from sequence databases, structural repositories, and scientific literature. This breadth of training allows NatureLM to learn cross-domain relationships that are invisible to domain-specific models — for instance, associations between protein binding sites and the chemical properties of compatible ligands, or between RNA secondary structure and the guide RNA designs that target it.

NatureLM is available in three sizes — 1B, 8B, and 46.7B parameters (the largest being a mixture-of-experts model) — with demonstrated performance scaling across sizes. At each scale it matches or surpasses state-of-the-art specialist models on multiple benchmarks, despite being a single generalist system.

Key Features

Unified cross-domain representation: All scientific entities — small molecules in SMILES notation, protein and RNA sequences, DNA, and material structures — are encoded as text sequences and processed by a single model, eliminating the need to coordinate separate specialist systems.
Text-guided generation and optimization: Scientific entities can be generated or optimized in response to natural language instructions, enabling researchers to specify desired properties conversationally rather than through task-specific APIs.
Cross-domain generation: The model can condition generation in one domain on inputs from another, supporting tasks such as designing a small-molecule ligand for a protein target, generating an RNA sequence for an RNA-binding protein, or engineering a guide RNA for CRISPR editing.
Multi-objective optimization: NatureLM can simultaneously optimize a scientific entity for multiple properties, such as binding affinity, ADMET profile, and synthetic accessibility in drug design contexts.
Scalable architecture: Three model sizes (1B, 8B, 46.7B) allow deployment across a range of computational budgets, with the 46.7B mixture-of-experts variant delivering the strongest benchmark performance.
Property prediction: Beyond generation, the model supports property prediction tasks including ADMET profiling, binding affinity estimation, and material property forecasting.

Technical Details

NatureLM is a GPT-style autoregressive language model extended through continued pre-training of an existing large language model on 143 billion tokens of domain-specific scientific data. The training corpus covers small molecule SMILES strings, protein sequences, RNA and DNA sequences, material structure representations, and accompanying scientific text. This design — building on a pretrained language model rather than training from scratch — allows the model to retain general language understanding while acquiring deep scientific domain knowledge.

The largest variant uses a 46.7B mixture-of-experts (MoE) architecture (equivalent to 8x7B active parameters per forward pass), consistent with the Mixtral family of models. The 1B and 8B dense variants share the same training protocol and data mixture. Benchmark evaluations span intra-domain tasks (molecule generation, protein design, RNA structure-conditioned generation, material property prediction) and cross-domain tasks (protein-to-molecule, protein-to-RNA generation). Across these evaluations, NatureLM-46.7B consistently matches or exceeds specialist models trained exclusively for each subdomain.

Applications

NatureLM is designed for researchers working at the boundaries of computational biology, medicinal chemistry, and materials science. Drug discovery teams can use it to generate hit compounds for a target protein, optimize leads for ADMET properties, and explore cross-domain hypotheses — all within a single model. Protein engineers can generate de novo sequences or design molecules conditioned on a protein of interest. RNA researchers can generate sequences for specific RNA-binding proteins or design guide RNAs for CRISPR applications. Materials scientists can specify target properties in natural language and receive candidate structures from a model that also understands biological design principles. The instruction-following interface lowers the barrier for wet-lab biologists who want to explore computational hypotheses without needing to configure and coordinate multiple specialist tools.

Impact

NatureLM represents a significant step toward a general-purpose scientific foundation model — an analogy in the biological and chemical sciences to what large language models have become for text. The work challenges the prevailing assumption that specialist models are necessary for competitive performance in each scientific subdomain, demonstrating that scale and cross-domain training can close or eliminate that gap. As a preprint published in February 2025, NatureLM has yet to accumulate the citation record of more established models, but the availability of model weights on HuggingFace and an active project website signal ongoing development. A key limitation is that the model operates on sequence representations and does not natively reason about three-dimensional structure — for tasks requiring explicit structural modeling, it would need to be combined with structure prediction tools such as AlphaFold or RoseTTAFold.

Overview

Key Features

Unified cross-domain representation: All scientific entities — small molecules in SMILES notation, protein and RNA sequences, DNA, and material structures — are encoded as text sequences and processed by a single model, eliminating the need to coordinate separate specialist systems.

Text-guided generation and optimization: Scientific entities can be generated or optimized in response to natural language instructions, enabling researchers to specify desired properties conversationally rather than through task-specific APIs.

Cross-domain generation: The model can condition generation in one domain on inputs from another, supporting tasks such as designing a small-molecule ligand for a protein target, generating an RNA sequence for an RNA-binding protein, or engineering a guide RNA for CRISPR editing.

Multi-objective optimization: NatureLM can simultaneously optimize a scientific entity for multiple properties, such as binding affinity, ADMET profile, and synthetic accessibility in drug design contexts.

Scalable architecture: Three model sizes (1B, 8B, 46.7B) allow deployment across a range of computational budgets, with the 46.7B mixture-of-experts variant delivering the strongest benchmark performance.

Property prediction: Beyond generation, the model supports property prediction tasks including ADMET profiling, binding affinity estimation, and material property forecasting.

Technical Details

Applications

Impact

NatureLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Metrics

Citations

HuggingFace

Tags

Resources

NatureLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Metrics

Citations

HuggingFace

Tags

Resources