Retrieval-augmented protein encoder models (150M–600M params) that condition on homologous sequences via block-causal multi-sequence attention; a drop-in ESM replacement.
E1 is a family of retrieval-augmented protein encoder models released by Profluent Bio in November 2025. Where most protein language models embed a single query sequence in isolation, E1 conditions its representations on a set of retrieved homologous sequences, bringing the evolutionary signal that profile- and MSA-based methods have long exploited into the architecture of a modern transformer encoder. The result is a model that combines the convenience of a single-pass encoder with the accuracy gains that come from explicitly attending to a query's evolutionary context.
The central problem E1 addresses is the trade-off researchers have faced between two families of protein models. Single-sequence encoders such as the ESM series are fast and easy to deploy but discard the rich coevolutionary information present in a protein's homologs, while MSA-based and structure-prediction systems capture that information at substantially greater computational and engineering cost. E1 narrows this gap by retrieving homologs and fusing them into the encoder through a block-causal multi-sequence attention mechanism, so a single forward pass can either run in standard single-sequence mode or take advantage of retrieved context when it is available.
E1 is explicitly designed as a drop-in replacement for the ESM family, lowering the barrier to adoption for teams that already build on ESM embeddings. The models are released in three sizes — E1-150m, E1-300m, and E1-600m — across the Profluent-AI GitHub repository and the Profluent-Bio HuggingFace organization.
E1 is a transformer encoder pretrained with a masked language modeling objective on approximately 4 trillion tokens. Its defining architectural component is block-causal multi-sequence attention, which arranges retrieved homologs as prepended context blocks attended to alongside the query sequence, allowing the encoder to incorporate evolutionary information directly rather than through an external profile. The family comprises three variants — 150M, 300M, and 600M parameters — released with BF16 weights. According to the accompanying preprint, E1 achieves state-of-the-art zero-shot results on protein fitness prediction, evaluated by average Spearman correlation on the substitution assays of the ProteinGym benchmark, and on unsupervised contact-map prediction evaluated on CAMEO. The repository provides notebooks for fitness prediction, site-saturation mutagenesis, and embedding extraction, with both single-sequence and retrieval-augmented inference modes.
E1 is suited to the protein representation tasks that ESM-class encoders are typically used for, with added accuracy from retrieval where homologs are available. Researchers can use it for zero-shot variant effect and fitness prediction to prioritize mutations for experimental testing, for unsupervised contact prediction to inform structural hypotheses, and as a general-purpose embedding backbone for downstream property prediction and protein engineering workflows. Because E1 is designed as a drop-in ESM replacement, teams with existing ESM-based pipelines for antibody engineering, enzyme design, or variant interpretation can adopt it with minimal changes and opt into retrieval when an evolutionary context is worth the extra cost.
E1 contributes to a broader shift toward bringing evolutionary context back into protein foundation models without paying the full cost of explicit MSA construction, positioning retrieval as a practical middle ground between single-sequence encoders and alignment-based methods. By packaging this capability as an ESM-compatible drop-in, Profluent lowers the switching cost for the large community already built on ESM embeddings, and the open-source code release under Apache-2.0 supports reproduction and extension. A notable caveat concerns licensing: while the code is Apache-2.0, the model weights are distributed under a custom gated clickthrough license (profluent-e1-clickthrough-license) with attribution requirements rather than a standard open-source license, so users should review those terms before commercial deployment. E1 is also distinct from Profluent's generative ProGen lineage — it is an encoder for representation and scoring, not a sequence generator.
Jain, S., et al. (2025) E1: Retrieval-Augmented Protein Encoder Models. bioRxiv.
DOI: 10.1101/2025.11.12.688125Papers that recently cited this model.
Aadyot Bhatnagar, Peter Morch Groth, Ali Madani
Apr 2026
Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Mar 2026
Shitong Dai, Jiaqi Luo, Yunan Luo
bioRxiv · Mar 2026
The most-cited papers that cite this model.
Navid NaderiAlizadeh, Rohit Singh
bioRxiv · Feb 2026
Aleena Siji, Amir Mohammad Karimi-Mamaghan, Ferdinand Kapl, et al.
arXiv.org · Feb 2026
Aadyot Bhatnagar, Peter Morch Groth, Ali Madani
Apr 2026
Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Mar 2026
Shitong Dai, Jiaqi Luo, Yunan Luo
bioRxiv · Mar 2026
Share of papers citing this model.