Overview

Protein language models such as ESM-2 learn powerful representations of individual protein sequences by training on hundreds of millions of sequences from the evolutionary record. However, they do so by treating each sequence as an independent sample, discarding the rich information available in related sequences — the homologs that share evolutionary history and, in many cases, functional similarity with the query protein. Biologists have long understood that co-evolutionary patterns across homologous sequences encode structural and functional constraints: this is the insight behind multiple sequence alignments (MSAs) and the co-evolutionary statistics that powered protein structure prediction before AlphaFold. RAG-ESM, developed by Damiano Sgarbossa and Anne-Florence Bitbol at the Bitbol Lab at EPFL (École Polytechnique Fédérale de Lausanne), proposes a principled way to inject this homolog-based co-evolutionary information into a pretrained ESM-2 model using a retrieval-augmented generation (RAG) framework, without discarding the powerful pretrained representations or requiring retraining the backbone from scratch.

The key idea of RAG-ESM is to adapt the encoder-decoder transformer paradigm — familiar from natural language processing — to the protein sequence domain. The model consists of two modules that share the weights of the pretrained ESM-2 backbone: an encoder that processes an unmasked homologous context sequence and produces per-residue embeddings, and a decoder that processes the masked query sequence and integrates context information via cross-attention layers inserted between the shared ESM-2 transformer blocks. Because the self-attention and feedforward weights of both encoder and decoder are tied to the same pretrained ESM-2 parameters, the only newly introduced parameters are the cross-attention projection matrices — a small fraction of the total parameter count. The model is trained on pairs of homologous sequences drawn from the OpenProteinSet database, using a masked language modeling objective with a discrete diffusion training scheme that enables the model to generate full sequences through iterative denoising, not just predict individual masked positions.

Published in PRX Life in August 2025 (preprint April 2025), RAG-ESM achieves two main capabilities simultaneously: improved masked amino acid prediction when a homologous context sequence is available, and state-of-the-art conditional protein sequence generation and motif scaffolding among sequence-based models. The work is notable for demonstrating that sequence alignment capabilities — the ability to identify structurally and functionally equivalent positions across divergent homologs — emerge spontaneously in specific cross-attention heads without any explicit alignment supervision, mirroring findings from natural language models where syntactic structure emerges from self-attention patterns.

Key Features

Shared-weight encoder-decoder architecture: RAG-ESM reuses the pretrained ESM-2 parameter matrices for both the encoder (processing the homolog context) and the decoder (processing the query), with cross-attention layers as the only newly introduced parameters. This design minimizes the number of parameters that must be learned from scratch, leverages the full power of ESM-2's pretrained representations in both processing streams, and keeps the total model size close to the original ESM-2 backbone.
Retrieval-augmented conditioning on homologs: At inference time, one or more homologous sequences are retrieved from a protein sequence database (e.g., UniRef or a custom database appropriate to the target protein family) and provided as context to the encoder. The decoder then generates predictions for the query sequence conditioned on the context embeddings, effectively giving the model access to co-evolutionary constraints without requiring the computational overhead of full MSA construction and processing.
Discrete diffusion for conditional sequence generation: Beyond masked residue prediction, RAG-ESM is trained with a discrete diffusion objective that enables generation of complete protein sequences through iterative denoising. Starting from a fully masked or randomly corrupted sequence, the model iteratively fills in residues conditioned on the homolog context, producing novel sequences that sample from the region of sequence space defined by the context protein's family. This makes RAG-ESM directly applicable to conditional protein design tasks.
Spontaneous emergence of sequence alignment in cross-attention: A mechanistically interesting finding is that specific cross-attention heads in RAG-ESM learn to align the query sequence with the context sequence — that is, to identify which positions in the context correspond to which positions in the query — without any explicit alignment supervision during training. This emergent alignment capability appears to underlie the model's ability to transfer functional and structural constraints from the context to predictions about the query, and provides interpretable mechanistic insight into how the model uses the retrieved homolog.
Substantial perplexity reduction relative to single-sequence ESM-2: When evaluated on masked amino acid prediction, RAG-ESM (12M parameters) and RAG-ESM (165M parameters) reduce perplexity by approximately 48% and 43%, respectively, compared to the base ESM-2 models of equivalent size when a closely related homolog is used as context. This demonstrates that the retrieved context provides information genuinely complementary to what ESM-2 captures from single sequences.
State-of-the-art motif scaffolding among sequence-based models: On motif scaffolding benchmarks — the task of generating protein backbones that contain a specified functional motif at a target location — RAG-ESM outperforms larger purely sequence-based models including DPLM (650M parameters) and MSA-based generative models such as EvoDiffusion-MSA. Structure-based methods such as RFDiffusion and the multimodal ESM-3 remain superior on some motifs, establishing the appropriate performance ceiling for a purely sequence-based approach.

Technical Details

RAG-ESM is built on the ESM-2 architecture and is available in two sizes corresponding to the 8M-parameter (resulting in a ~12M RAG-ESM) and the 150M-parameter (resulting in a ~165M RAG-ESM) ESM-2 base models. Cross-attention layers are inserted after every standard ESM-2 transformer block in the decoder. Each cross-attention layer uses the standard multi-head attention formulation: the query projections are derived from the decoder's residue representations, while key and value projections are derived from the encoder's context sequence representations. The projection matrices for these cross-attention layers are initialized randomly and trained from the RAG-ESM training objective, while the self-attention and feedforward weights are initialized from pretrained ESM-2 and continue to be updated during training.

Training uses the OpenProteinSet dataset, a large curated collection of protein sequence clusters with associated MSAs. During training, pairs of sequences are sampled from within the same cluster, with one sequence designated as the input query (subjected to random masking) and the closest neighbor by Hamming distance designated as the context sequence. This choice of context selection — nearest neighbor in the training cluster — simulates the retrieval step that will be performed at inference time. The training objective combines standard cross-entropy loss over masked positions with a discrete diffusion loss that enables multi-step generation. An error correction strategy is incorporated during the diffusion denoising process to reduce accumulation of errors across generation steps.

The key parameter efficiency claim is quantified precisely: the cross-attention parameters constitute a small fraction of the total RAG-ESM parameter count, with the majority of parameters shared directly from the pretrained ESM-2 backbone. At inference for sequence generation, RAG-ESM iterates over multiple denoising steps (the number is configurable), at each step conditioning on both the partially generated sequence and the context homolog. For masked prediction only (no generation), inference is a single forward pass equivalent in cost to standard ESM-2 inference plus the marginal overhead of the cross-attention computation.

Applications

RAG-ESM opens several avenues in computational protein design and annotation that were difficult to address with single-sequence language models alone. For protein engineers designing variants within a known protein family, RAG-ESM can generate diverse sequences conditioned on a natural homolog as context, sampling functional sequence space in a family-informed manner that respects co-evolutionary constraints. This is particularly relevant for enzyme engineering, antibody design, and therapeutic protein optimization, where generated variants must remain within the evolutionary envelope of functional sequences. For motif scaffolding — designing a protein scaffold that places a specific functional motif (an active site, a binding epitope, a disulfide bridge) in a structurally compatible context — RAG-ESM's retrieval-augmented generation provides a principled mechanism for sampling scaffolds consistent with the structural constraints implied by homologs that carry similar motifs. Researchers studying protein families with few members can use RAG-ESM in inference mode to impute probable residue identities at ambiguous positions, leveraging context from related proteins to sharpen predictions beyond what single-sequence ESM-2 achieves. The emergent sequence alignment capability of cross-attention heads may also prove useful for protein sequence comparison applications beyond the primary prediction tasks.

Impact

RAG-ESM represents a conceptually clean solution to a recognized limitation of single-sequence protein language models: the discarding of co-evolutionary information that is available in homologs and that has historically been the most powerful signal for understanding protein structure and function. By demonstrating that retrieval augmentation can be grafted onto pretrained ESM-2 models via parameter-efficient cross-attention layers, the work establishes a modular design pattern for incorporating database retrieval into protein language models that other groups may adapt for different retrieval sources (e.g., structural neighbors from AlphaFold DB, functional analogs from enzyme databases). The finding that sequence alignment emerges spontaneously in cross-attention heads connects RAG-ESM to a broader literature on emergent capabilities in large models and provides a mechanistically interpretable story for why the approach works. The paper was published in PRX Life, a journal of the American Physical Society focused on quantitative biology, reflecting the theoretical depth of the mechanistic analysis alongside the practical results. Key limitations include dependency on homolog availability — performance degrades when the retrieved context is evolutionarily distant — and the model has not been benchmarked on structure prediction tasks, leaving open how its improved sequence representations translate to structural accuracy. The code and model weights are publicly released under the Bitbol-Lab GitHub organization, enabling direct reuse and extension.

Overview

Key Features

Shared-weight encoder-decoder architecture: RAG-ESM reuses the pretrained ESM-2 parameter matrices for both the encoder (processing the homolog context) and the decoder (processing the query), with cross-attention layers as the only newly introduced parameters. This design minimizes the number of parameters that must be learned from scratch, leverages the full power of ESM-2's pretrained representations in both processing streams, and keeps the total model size close to the original ESM-2 backbone.

Retrieval-augmented conditioning on homologs: At inference time, one or more homologous sequences are retrieved from a protein sequence database (e.g., UniRef or a custom database appropriate to the target protein family) and provided as context to the encoder. The decoder then generates predictions for the query sequence conditioned on the context embeddings, effectively giving the model access to co-evolutionary constraints without requiring the computational overhead of full MSA construction and processing.

Discrete diffusion for conditional sequence generation: Beyond masked residue prediction, RAG-ESM is trained with a discrete diffusion objective that enables generation of complete protein sequences through iterative denoising. Starting from a fully masked or randomly corrupted sequence, the model iteratively fills in residues conditioned on the homolog context, producing novel sequences that sample from the region of sequence space defined by the context protein's family. This makes RAG-ESM directly applicable to conditional protein design tasks.

Spontaneous emergence of sequence alignment in cross-attention: A mechanistically interesting finding is that specific cross-attention heads in RAG-ESM learn to align the query sequence with the context sequence — that is, to identify which positions in the context correspond to which positions in the query — without any explicit alignment supervision during training. This emergent alignment capability appears to underlie the model's ability to transfer functional and structural constraints from the context to predictions about the query, and provides interpretable mechanistic insight into how the model uses the retrieved homolog.

Substantial perplexity reduction relative to single-sequence ESM-2: When evaluated on masked amino acid prediction, RAG-ESM (12M parameters) and RAG-ESM (165M parameters) reduce perplexity by approximately 48% and 43%, respectively, compared to the base ESM-2 models of equivalent size when a closely related homolog is used as context. This demonstrates that the retrieved context provides information genuinely complementary to what ESM-2 captures from single sequences.

State-of-the-art motif scaffolding among sequence-based models: On motif scaffolding benchmarks — the task of generating protein backbones that contain a specified functional motif at a target location — RAG-ESM outperforms larger purely sequence-based models including DPLM (650M parameters) and MSA-based generative models such as EvoDiffusion-MSA. Structure-based methods such as RFDiffusion and the multimodal ESM-3 remain superior on some motifs, establishing the appropriate performance ceiling for a purely sequence-based approach.

Technical Details

Applications

Impact

RAG-ESM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

RAG-ESM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources