Multimodal generative protein language model that jointly reasons over sequence, structure, and function. Trained at 98B parameters on 2.78 billion proteins.
ESM-3 is a frontier multimodal generative protein language model developed by EvolutionaryScale and released in June 2024. Unlike prior protein language models that reason exclusively over amino acid sequences, ESM-3 jointly models the sequence, three-dimensional structure, and functional annotations of proteins within a unified framework. This design allows the model to be prompted with any combination of the three modalities — for example, specifying a partial structure while leaving the sequence unconstrained — and then generate coherent completions across all tracks simultaneously.
The model represents a significant departure from the ESM lineage developed at Meta AI Research. Where ESM-2 was a sequence-only masked language model with up to 15 billion parameters, ESM-3 was built from the ground up as a generative, multimodal system. At its largest scale, ESM-3 contains 98 billion parameters and was trained with over 10^24 floating-point operations — more compute than any previously known biology foundation model — on a dataset of 2.78 billion proteins spanning the full diversity of life on Earth.
The model's most striking demonstration was the de novo design of esmGFP, a green fluorescent protein sharing only 58% sequence identity with the nearest known natural fluorescent protein. The authors estimate this degree of sequence divergence corresponds to roughly 500 million years of natural evolution, underscoring the model's capacity to explore protein space far beyond the boundaries of observed biology.
ESM-3's architecture is a multi-track transformer in which sequence, structure, and function tokens are processed through a shared stack of transformer blocks. A geometric attention module in the first block enables direct conditioning on atomic coordinates. Structure tokens are derived from backbone (N, Ca, C, O) atomic positions via a VQ-VAE trained separately. Function annotations are encoded from InterPro domain calls using a library of hidden Markov models. All three token tracks are fused within a single latent space and trained jointly using a masked language modeling objective: for each protein, a mix of sequence, structure, and function positions is masked, and the model learns to predict the masked values.
Training data was assembled from UniRef, MGnify, and other public sequence databases for sequences; the Protein Data Bank supplemented with AlphaFold2 and ESMFold predictions for structure; and function predictions from profile HMMs for annotations. The three scales — 1.4B, 7B, and 98B parameters — were trained at progressively higher compute budgets up to 1.07 × 10^24 FLOPs. The 98B model uses approximately 25 times more compute and 60 times more data than ESM-2. Structure prediction benchmarks show ESM-3 outperforms ESMFold, though it falls short of AlphaFold2 on standard benchmarks. In generative evaluations, ESM-3 achieves greater than 50% functional success rates when designing variants in distant sequence families.
ESM-3 is suited to a broad range of protein engineering and discovery tasks. Researchers can use the model for inverse folding — generating sequences that fold into a specified backbone — as well as for exploring sequence space around a functional protein to find distant homologs with altered properties. The functional conditioning mechanism supports targeted design toward proteins bearing specific InterPro domains. In drug discovery, the model can be used to generate novel antibody or enzyme scaffolds. In basic research, its multimodal representations provide rich embeddings for downstream tasks including variant effect prediction, contact prediction, and protein family classification. The open 1.4B checkpoint is accessible via the EvolutionaryScale Python SDK and HuggingFace, while the larger proprietary models are available through the EvolutionaryScale API.
ESM-3 represents a conceptual shift in protein foundation models from discriminative sequence encoders to generative multimodal systems. Its publication in Science in January 2025 (Hayes et al.) brought broad attention to the capacity of language models to explore protein space at evolutionary timescales, with the esmGFP result serving as a concrete proof of capability. EvolutionaryScale launched alongside the model with a $142 million seed round, signaling significant commercial interest in generative biology. A key limitation is access: only the 1.4B parameter model is fully open-weight, while the more capable 7B and 98B variants require API access through EvolutionaryScale's platform. The model also inherits the standard caveats of language model approaches — generated sequences require experimental validation, and performance on proteins with little training data coverage may be reduced.
Hayes, T., et al. (2025) Simulating 500 million years of evolution with a language model.. Science.
DOI: 10.1126/science.ads0018