bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

Aiki-XP

Aikium

Leakage-controlled multimodal model predicting within-species relative protein expression across 385 bacterial species, with transfer to unseen phyla.

Released: April 2026
Parameters: 25 Million

Aiki-XP is a multimodal model from Aikium Inc. that predicts the relative expression of a protein within its host bacterium's proteome. Rather than forecasting absolute yield in µg/mL, it ranks candidate genes by per-species z-scored abundance — answering the practical question of which sequences a given organism is likely to express well. This addresses a long-standing bottleneck in heterologous protein production and synthetic biology, where expression levels are notoriously difficult to predict from sequence alone.

The central methodological contribution is rigorous leakage control. Because homologous proteins recur across bacterial genomes, naive train/test splits leak information and inflate reported performance. Aiki-XP instead groups genes into MMseqs2 sequence clusters and partitions those clusters between training and evaluation, so the model is tested on genuinely novel sequences. Critically, the authors report that all training recipes and hyperparameters were locked before external evaluation, and the model demonstrates transfer to bacterial phyla not seen during training — evidence that it captures generalizable determinants of expression rather than memorizing phylogenetic signal.

Posted to bioRxiv in April 2026, Aiki-XP is part of Aikium's broader family of "Aiki" foundation models for protein phenotypes, and operates at pan-bacterial scale across hundreds of species.

#Key Features

  • Within-species expression ranking: Predicts per-species z-scored relative abundance, framing expression as a ranking problem that is robust across organisms with very different baseline expression distributions.
  • Five-modality fusion: Integrates protein identity (ESM-C), coding-sequence composition (HyenaDNA), genome context (Bacformer), operon architecture (Evo-2 7B), and engineered biophysical features (codon usage, disorder, RNA folding) through a compact 25M-parameter fusion head.
  • Leakage-controlled evaluation: A gene-operon split based on MMseqs2 clustering partitions homologs between train and test sets, preventing the phylogenetic leakage that inflates many expression-prediction benchmarks.
  • Cross-phylum transfer: Generalizes to bacterial phyla absent from training, with recipes frozen prior to external evaluation to guard against tuning on the test distribution.
  • Tiered deployment: Five inference tiers (A through D/XP5) add modalities progressively, trading input requirements for accuracy and enabling protein-sequence-only predictions when genomic context is unavailable.

#Technical Details

Aiki-XP is built around a 25M-parameter multimodal fusion head that combines embeddings from several pretrained foundation models with handcrafted biophysical descriptors. Training spans 492,026 genes drawn from 385 bacterial species (with 1,831 host genomes available), using roughly 360 A100-hours for five-fold fusion-head training plus about 1,000 A100-hours to precompute upstream embeddings. On non-conserved (held-out) genes, the full Tier D/XP5 model reaches a Spearman correlation of ρ_nc ≈ 0.59, compared with 0.518 for the protein-only Tier A and 0.509 for an ESM-C 600M baseline; the median absolute error is roughly 0.47 z-scores, with 95% of predictions within |Δ| < 1.5. Code is released under Apache 2.0, while model weights and training data are archived on Zenodo (DOI 10.5281/zenodo.19639621, CC-BY 4.0, ~28 GB), with a Python client, Docker images, and a hosted demo for inference.

#Applications

Aiki-XP is aimed at researchers and engineers optimizing recombinant protein production, where selecting expression-friendly constructs or host organisms can dramatically reduce trial-and-error at the bench. By ranking candidates before synthesis, it can accelerate protein engineering campaigns, inform choice of expression host, and help prioritize sequence variants in synthetic biology and industrial enzyme workflows. The tiered design lets users apply the model with only a protein sequence or with full genomic and operon context, fitting both early-stage triage and detailed construct design.

#Impact

By foregrounding leakage control and pre-registered recipes, Aiki-XP offers a more honest benchmark for bacterial expression prediction than splits that allow homolog leakage, and its demonstrated cross-phylum transfer suggests the learned signal reflects real determinants of expression. The release of permissively licensed code, weights, training data, a client library, and a live demo lowers the barrier for adoption and reproduction. As a recent preprint, its broader influence on protein engineering practice and downstream tooling remains to be established, and predictions are explicitly relative rankings rather than calibrated absolute yields — a limitation the authors emphasize.

Tags

protein_expressionprotein_engineeringtransformermultimodal_fusionfoundation_modelmultimodalproteomicsgenomics