E1

Retrieval-augmented protein encoder models (150M–600M params) that condition on homologous sequences via block-causal multi-sequence attention; a drop-in ESM replacement.

Released: November 2025

E1 is a family of retrieval-augmented protein encoder models released by Profluent Bio in November 2025. Where most protein language models embed a single query sequence in isolation, E1 conditions its representations on a set of retrieved homologous sequences, bringing the evolutionary signal that profile- and MSA-based methods have long exploited into the architecture of a modern transformer encoder. The result is a model that combines the convenience of a single-pass encoder with the accuracy gains that come from explicitly attending to a query's evolutionary context.

The central problem E1 addresses is the trade-off researchers have faced between two families of protein models. Single-sequence encoders such as the ESM series are fast and easy to deploy but discard the rich coevolutionary information present in a protein's homologs, while MSA-based and structure-prediction systems capture that information at substantially greater computational and engineering cost. E1 narrows this gap by retrieving homologs and fusing them into the encoder through a block-causal multi-sequence attention mechanism, so a single forward pass can either run in standard single-sequence mode or take advantage of retrieved context when it is available.

E1 is explicitly designed as a drop-in replacement for the ESM family, lowering the barrier to adoption for teams that already build on ESM embeddings. The models are released in three sizes — E1-150m, E1-300m, and E1-600m — across the Profluent-AI GitHub repository and the Profluent-Bio HuggingFace organization.

Key Features

Retrieval-augmented encoding: E1 retrieves homologous sequences for a query and conditions its representations on them, injecting coevolutionary signal that single-sequence models cannot access while preserving a single encoder forward pass.
Block-causal multi-sequence attention: A custom attention scheme lets the model attend across a block of prepended homologs and the query together, the mechanism that fuses retrieved context into the query's contextualized embeddings.
Drop-in ESM replacement: The models expose a masked-token interface and embedding outputs compatible with the ESM family, so existing pipelines can swap in E1 with minimal code changes and optionally enable retrieval.
Three model sizes: E1-150m, E1-300m, and E1-600m span a range of parameter budgets, letting users trade compute for accuracy depending on the task and available hardware.
Zero-shot state-of-the-art: E1 reports state-of-the-art zero-shot performance on protein fitness prediction and contact-map benchmarks without any task-specific fine-tuning.

Technical Details

E1 is a transformer encoder pretrained with a masked language modeling objective on approximately 4 trillion tokens. Its defining architectural component is block-causal multi-sequence attention, which arranges retrieved homologs as prepended context blocks attended to alongside the query sequence, allowing the encoder to incorporate evolutionary information directly rather than through an external profile. The family comprises three variants — 150M, 300M, and 600M parameters — released with BF16 weights. According to the accompanying preprint, E1 achieves state-of-the-art zero-shot results on protein fitness prediction, evaluated by average Spearman correlation on the substitution assays of the ProteinGym benchmark, and on unsupervised contact-map prediction evaluated on CAMEO. The repository provides notebooks for fitness prediction, site-saturation mutagenesis, and embedding extraction, with both single-sequence and retrieval-augmented inference modes.

Applications

E1 is suited to the protein representation tasks that ESM-class encoders are typically used for, with added accuracy from retrieval where homologs are available. Researchers can use it for zero-shot variant effect and fitness prediction to prioritize mutations for experimental testing, for unsupervised contact prediction to inform structural hypotheses, and as a general-purpose embedding backbone for downstream property prediction and protein engineering workflows. Because E1 is designed as a drop-in ESM replacement, teams with existing ESM-based pipelines for antibody engineering, enzyme design, or variant interpretation can adopt it with minimal changes and opt into retrieval when an evolutionary context is worth the extra cost.

Impact

E1 contributes to a broader shift toward bringing evolutionary context back into protein foundation models without paying the full cost of explicit MSA construction, positioning retrieval as a practical middle ground between single-sequence encoders and alignment-based methods. By packaging this capability as an ESM-compatible drop-in, Profluent lowers the switching cost for the large community already built on ESM embeddings, and the open-source code release under Apache-2.0 supports reproduction and extension. A notable caveat concerns licensing: while the code is Apache-2.0, the model weights are distributed under a custom gated clickthrough license (profluent-e1-clickthrough-license) with attribution requirements rather than a standard open-source license, so users should review those terms before commercial deployment. E1 is also distinct from Profluent's generative ProGen lineage — it is an encoder for representation and scoring, not a sequence generator.

Citation

E1: Retrieval-Augmented Protein Encoder Models

Preprint

Jain, S., et al. (2025) E1: Retrieval-Augmented Protein Encoder Models. bioRxiv.

DOI: 10.1101/2025.11.12.688125

Recent citations

Papers that recently cited this model.

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
Aadyot Bhatnagar, Peter Morch Groth, Ali Madani
Apr 2026
0
Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Mar 2026
0
Evolutionary profile enhancement improves protein function annotation for remote homologs
Shitong Dai, Jiaqi Luo, Yunan Luo
bioRxiv · Mar 2026
0

Top citations

The most-cited papers that cite this model.

EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings
Navid NaderiAlizadeh, Rohit Singh
bioRxiv · Feb 2026
3Influential
From Words to Amino Acids: Does the Curse of Depth Persist?
Aleena Siji, Amir Mohammad Karimi-Mamaghan, Ferdinand Kapl, et al.
arXiv.org · Feb 2026
2
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
Aadyot Bhatnagar, Peter Morch Groth, Ali Madani
Apr 2026
0
Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò
Mar 2026
0
Evolutionary profile enhancement improves protein function annotation for remote homologs
Shitong Dai, Jiaqi Luo, Yunan Luo
bioRxiv · Mar 2026
0

Citations

Total Citations9

Influential2

References0

GitHub

Stars113

Forks15

Open Issues2

Contributors1

Last Push3mo ago

LanguagePython

HuggingFace

Downloads16.9K

Likes1

Last Modified7mo ago

Fields of citing research

Biology100%
Computer Science100%
Chemistry22%
Medicine11%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

47Partial

Usability — can I run it?71

Reproducibility — can I retrain it?17

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Retrieval-augmented encoding: E1 retrieves homologous sequences for a query and conditions its representations on them, injecting coevolutionary signal that single-sequence models cannot access while preserving a single encoder forward pass.

Block-causal multi-sequence attention: A custom attention scheme lets the model attend across a block of prepended homologs and the query together, the mechanism that fuses retrieved context into the query's contextualized embeddings.

Drop-in ESM replacement: The models expose a masked-token interface and embedding outputs compatible with the ESM family, so existing pipelines can swap in E1 with minimal code changes and optionally enable retrieval.

Three model sizes: E1-150m, E1-300m, and E1-600m span a range of parameter budgets, letting users trade compute for accuracy depending on the task and available hardware.

Zero-shot state-of-the-art: E1 reports state-of-the-art zero-shot performance on protein fitness prediction and contact-map benchmarks without any task-specific fine-tuning.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Aadyot Bhatnagar, Peter Morch Groth, Ali Madani

Apr 2026

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò

Mar 2026

Evolutionary profile enhancement improves protein function annotation for remote homologs

Shitong Dai, Jiaqi Luo, Yunan Luo

bioRxiv · Mar 2026

Top citations

The most-cited papers that cite this model.

EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings

Navid NaderiAlizadeh, Rohit Singh

bioRxiv · Feb 2026

3Influential

From Words to Amino Acids: Does the Curse of Depth Persist?

Aleena Siji, Amir Mohammad Karimi-Mamaghan, Ferdinand Kapl, et al.

arXiv.org · Feb 2026

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Aadyot Bhatnagar, Peter Morch Groth, Ali Madani

Apr 2026

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Andrea Giuseppe Di Francesco, Andrea Rubbi, Pietro Liò

Mar 2026

Evolutionary profile enhancement improves protein function annotation for remote homologs

Shitong Dai, Jiaqi Luo, Yunan Luo

bioRxiv · Mar 2026

E1

#Key Features

#Technical Details

#Applications

#Impact

Citation

E1: Retrieval-Augmented Protein Encoder Models

Recent citations

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Top citations

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

E1

#Key Features

#Technical Details

#Applications

#Impact

Citation

E1: Retrieval-Augmented Protein Encoder Models

Recent citations

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Top citations

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact