Dual-encoder contrastive model that retrieves enzymes for query reactions by matching reaction fingerprints to protein sequence embeddings.
Horizyn-1 is a dual-encoder contrastive-learning model from Dayhoff Labs that matches enzymatic reactions to the proteins capable of catalyzing them. A large fraction of known biochemical reactions are "orphan" — they have no assigned enzyme — and conversely many sequenced proteins have unknown or only loosely assigned catalytic function. Horizyn-1 tackles this matching problem directly by learning a shared embedding space in which a reaction and its candidate enzymes sit close together, turning enzyme discovery into a retrieval task.
The model encodes reactions as chemical fingerprints and proteins as ProtT5-XL embeddings, then trains the two encoders contrastively on millions of reaction-enzyme pairs so that compatible pairs align in a 512-dimensional space. Given a query reaction, it ranks a database of proteins by predicted catalytic compatibility, reporting greater than 75% top-100 recall. Its primary published account appeared in PNAS in March 2026, with an earlier bioRxiv preprint, and unusually for this class of model it ships with open code and a hosted inference API.
Importantly, Horizyn-1 is built for reaction-to-enzyme retrieval and screening — not de novo sequence design — which distinguishes it from generative enzyme-design methods and from related discovery tools such as DISCO.
Horizyn-1 uses two MLP-based encoders trained to a shared 512-dimensional, L2-normalized embedding space. Reactions are represented by combined RDKit and DRFP structural fingerprints; proteins are represented by pre-computed ProtT5-XL transformer embeddings. The encoders are aligned with a Maximum Likelihood Noise Contrastive Estimation (MLNCE) loss over millions of reaction-enzyme pairs, so that retrieval reduces to nearest-neighbor search in the joint space. The authors report greater than 75% top-100 recall and logarithmic performance scaling with dataset size, and show that few-shot fine-tuning (fewer than 10 examples) recovers accuracy on underrepresented EC classes. The released implementation (PyTorch Lightning) provides command-line querying against an inference checkpoint (~402 MB) and requires roughly 16 GB of GPU VRAM; the code is distributed under the PolyForm Noncommercial License 1.0.0.
Horizyn-1 serves enzymologists, metabolic engineers, and biocatalysis researchers who need to find candidate enzymes for a reaction of interest — assigning function to orphan reactions, identifying promiscuous enzymes that may act on new substrates, and sourcing catalysts for non-natural transformations such as building non-canonical amino acids via lysine transamination. Because it is a retrieval tool rather than a generator, it fits naturally as a screening front end: rank a protein database for a target reaction, then take top hits forward to experimental testing. The hosted API and open code lower the barrier to integrating it into discovery pipelines.
By framing enzyme discovery as cross-modal retrieval between reaction fingerprints and protein language model embeddings, Horizyn-1 offers a scalable, experimentally validated route to closing the gap between cataloged reactions and the enzymes that run them. Its demonstrations on orphan reactions, promiscuity, and non-natural chemistry, together with publication in PNAS and a release that includes open code and a hosted API, make it a practically usable contribution rather than a benchmark-only result. The principal limitation is scope: it retrieves and screens existing proteins and does not design new enzyme sequences, and its code license restricts commercial use without a separate agreement.