bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
ProteinSmall molecule

Horizyn-1

Dayhoff Labs

Dual-encoder contrastive model that retrieves enzymes for query reactions by matching reaction fingerprints to protein sequence embeddings.

Released: March 2026

Horizyn-1 is a dual-encoder contrastive-learning model from Dayhoff Labs that matches enzymatic reactions to the proteins capable of catalyzing them. A large fraction of known biochemical reactions are "orphan" — they have no assigned enzyme — and conversely many sequenced proteins have unknown or only loosely assigned catalytic function. Horizyn-1 tackles this matching problem directly by learning a shared embedding space in which a reaction and its candidate enzymes sit close together, turning enzyme discovery into a retrieval task.

The model encodes reactions as chemical fingerprints and proteins as ProtT5-XL embeddings, then trains the two encoders contrastively on millions of reaction-enzyme pairs so that compatible pairs align in a 512-dimensional space. Given a query reaction, it ranks a database of proteins by predicted catalytic compatibility, reporting greater than 75% top-100 recall. Its primary published account appeared in PNAS in March 2026, with an earlier bioRxiv preprint, and unusually for this class of model it ships with open code and a hosted inference API.

Importantly, Horizyn-1 is built for reaction-to-enzyme retrieval and screening — not de novo sequence design — which distinguishes it from generative enzyme-design methods and from related discovery tools such as DISCO.

#Key Features

  • Reaction-to-enzyme retrieval: Given a query reaction, ranks proteins by catalytic compatibility, achieving over 75% top-100 recall against large enzyme databases.
  • Dual-encoder contrastive design: A reaction encoder (RDKit and DRFP fingerprints through an MLP) and a protein encoder (ProtT5-XL embeddings through an MLP) are aligned via a Maximum Likelihood Noise Contrastive Estimation objective into 512-dim embeddings.
  • Experimentally grounded scope: Validated for orphan reactions, enzyme promiscuity, and non-natural reactions, including lysine transamination for non-canonical amino acids.
  • Few-shot adaptability: Fine-tuning on fewer than 10 examples improves performance on underrepresented reaction classes.
  • Predictable scaling: Performance scales logarithmically with training dataset size.
  • Open and hosted: Released with Python/PyTorch-Lightning code on GitHub and a hosted inference API at horizyn.dayhofflabs.com.

#Technical Details

Horizyn-1 uses two MLP-based encoders trained to a shared 512-dimensional, L2-normalized embedding space. Reactions are represented by combined RDKit and DRFP structural fingerprints; proteins are represented by pre-computed ProtT5-XL transformer embeddings. The encoders are aligned with a Maximum Likelihood Noise Contrastive Estimation (MLNCE) loss over millions of reaction-enzyme pairs, so that retrieval reduces to nearest-neighbor search in the joint space. The authors report greater than 75% top-100 recall and logarithmic performance scaling with dataset size, and show that few-shot fine-tuning (fewer than 10 examples) recovers accuracy on underrepresented EC classes. The released implementation (PyTorch Lightning) provides command-line querying against an inference checkpoint (~402 MB) and requires roughly 16 GB of GPU VRAM; the code is distributed under the PolyForm Noncommercial License 1.0.0.

#Applications

Horizyn-1 serves enzymologists, metabolic engineers, and biocatalysis researchers who need to find candidate enzymes for a reaction of interest — assigning function to orphan reactions, identifying promiscuous enzymes that may act on new substrates, and sourcing catalysts for non-natural transformations such as building non-canonical amino acids via lysine transamination. Because it is a retrieval tool rather than a generator, it fits naturally as a screening front end: rank a protein database for a target reaction, then take top hits forward to experimental testing. The hosted API and open code lower the barrier to integrating it into discovery pipelines.

#Impact

By framing enzyme discovery as cross-modal retrieval between reaction fingerprints and protein language model embeddings, Horizyn-1 offers a scalable, experimentally validated route to closing the gap between cataloged reactions and the enzymes that run them. Its demonstrations on orphan reactions, promiscuity, and non-natural chemistry, together with publication in PNAS and a release that includes open code and a hosted API, make it a practically usable contribution rather than a benchmark-only result. The principal limitation is scope: it retrieves and screens existing proteins and does not design new enzyme sequences, and its code license restricts commercial use without a separate agreement.

GitHub

Stars11
Forks1
Open Issues2
Contributors3
Last Push12d ago
LanguagePython

Openness

bio.rodeo opennessClosed · low usability and reproducibility
21Closed
Usability — can I run it?21
Reproducibility — can I retrain it?13
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

enzyme_reaction_matchingfunction_annotationretrievaltransformercontrastive_learningrepresentation_learningenzymologyreactions

Resources

GitHub RepositoryResearch PaperDemo