genbio.ai
A 16-billion-parameter mixture-of-experts protein language model pretrained on 1.2 trillion amino acids, the first MoE architecture applied to protein foundation modeling.
AIDO.Protein is a large-scale protein language model developed by GenBio AI as the protein-domain component of the AIDO (AI-Driven Digital Organism) multiscale platform. Released in November 2024, it introduces the first application of a sparse mixture-of-experts (MoE) architecture to protein language modeling, scaling to 16 billion total parameters while activating only a fraction of those parameters per forward pass. This architectural choice allows AIDO.Protein to achieve parameter counts that would be computationally prohibitive with a dense transformer, enabling richer representations of amino acid sequence space than prior protein language models without proportional increases in training or inference compute.
The protein language model landscape has been defined by dense encoder models in the ESM family (ESM-2 reaching 15B parameters) and autoregressive models such as ProGen2 (up to 6.4B parameters). AIDO.Protein's adoption of the MoE paradigm brings to protein science a scaling strategy that has proven highly effective in the natural language processing domain — where models such as Mixtral and GPT-4 achieve strong performance through conditional computation — but had not previously been systematically explored for biological sequence modeling. MoE models deploy multiple specialized expert sub-networks within each transformer layer and route each input token to a subset of those experts, creating a form of conditional specialization that may be particularly well suited to protein sequences, where different sequence contexts (e.g., active sites versus structural scaffolds, IDRs versus folded domains) benefit from different representational emphases.
AIDO.Protein was pretrained on 1.2 trillion amino acids from UniRef90 and ColabFoldDB, achieving state-of-the-art performance across most tasks in the xTrimoPGLM protein language model benchmark. Notably, on deep mutational scanning fitness prediction — a gold-standard evaluation of protein language model quality — AIDO.Protein achieves approximately 99% of the performance of the best multiple sequence alignment (MSA)-based model while operating solely from single-sequence input, a substantial advance over prior single-sequence approaches. The model has also been adapted for structure-conditioned protein sequence generation tasks, establishing new state-of-the-art results in inverse folding.
AIDO.Protein implements a sparse MoE transformer architecture in which each transformer block contains a standard multi-head self-attention layer and a MoE feed-forward layer replacing the conventional dense feed-forward network. Each MoE layer deploys 8 expert networks, and for each input token, a gating network selects 2 experts to process that token — an approach commonly referred to as top-2 routing. The total parameter count of 16 billion includes all expert parameters, but only the 2 active experts per token contribute to any given forward pass, making the effective parameter count per computation substantially lower. This architecture achieves a significant speedup in training and inference relative to a hypothetical dense model of equivalent total parameter count.
The pretraining corpus of 1.2 trillion amino acids is drawn from two sources: UniRef90, which provides non-redundant representative sequences from UniProtKB clustered at 90% sequence identity, and ColabFoldDB, a large collection of protein sequences for which high-confidence structural predictions have been made using AlphaFold2 via the ColabFold software. Including ColabFoldDB extends the training distribution beyond experimentally characterized proteins to the computationally predicted proteome, adding structural context (via sequence selection) and substantially increasing the coverage of microbial and metagenomic protein families. The training objective is a standard masked language modeling (MLM) approach applied to amino acid sequences, consistent with encoder-only protein language models in the ESM family.
Performance on the xTrimoPGLM benchmark — a comprehensive suite designed specifically for evaluating protein language model capabilities — places AIDO.Protein at or near the top across the majority of tasks. Particularly notable is the deep mutational scanning fitness prediction performance: evaluated across more than 280 assays from the ProteinGym benchmark, AIDO.Protein achieves approximately 99% of the Spearman correlation obtained by the best MSA-based model (which uses computationally expensive multiple sequence alignment as input), while requiring only a single sequence. This result substantially narrows the gap between single-sequence and alignment-based methods — a gap that has previously been one of the strongest arguments for maintaining MSA construction pipelines in protein engineering workflows. The model has also been evaluated on structure-conditioned sequence generation: a variant trained on structure-sequence pairs achieves state-of-the-art performance on the CATH inverse folding benchmark, demonstrating that the MoE backbone generalizes effectively to multimodal protein tasks.
The HuggingFace release includes a retrieval-augmented generation (RAG) variant, AIDO.Protein-RAG-3B, which augments sequence representations with retrieved homologous sequence information — a hybrid approach that captures some of the evolutionary context normally provided by explicit MSA construction while maintaining a single-pass computational profile.
AIDO.Protein is applicable across the full span of protein engineering and computational biology workflows where accurate sequence representations are valuable. Protein engineers designing novel enzymes, antibodies, or therapeutic proteins can use the model's zero-shot fitness scoring to rank mutant libraries by predicted activity before committing to experimental synthesis and testing — a workflow that significantly reduces the cost of directed evolution campaigns by focusing experimental effort on the highest-confidence candidates. Structure-conditioned protein design using the inverse folding variant enables the generation of novel sequences that fold into desired backbone geometries, directly applicable to de novo protein design projects targeting specific binding interfaces or catalytic active site geometries. Drug discovery teams can leverage AIDO.Protein's variant effect predictions to assess the impact of patient mutations on drug target function or to identify escape mutations that might reduce therapeutic efficacy. Researchers building computational pipelines for proteomics interpretation, functional annotation of uncharacterized proteins, or metagenomics analysis can use the model's embeddings as general-purpose sequence representations that encode structural and functional properties without requiring explicit structure prediction. The AIDO.ModelGenerator integration makes it straightforward to fine-tune AIDO.Protein on task-specific labeled datasets spanning solubility, thermostability, subcellular localization, protein-protein interaction, and other supervised tasks.
AIDO.Protein makes two distinct contributions to the protein language model field. The first is empirical: demonstrating that near-MSA-quality fitness prediction is achievable from single sequences using a sufficiently large and well-architected model. This result has practical significance because MSA computation is computationally expensive and can fail for orphan proteins or novel protein families with few known homologs. If single-sequence models can approach MSA-based performance through scale, a major bottleneck in computational protein engineering workflows is substantially reduced. The second contribution is methodological: establishing the sparse MoE architecture as a viable and effective approach for protein language modeling. Given the success of MoE scaling in NLP, the demonstration that this architecture transfers to protein sequences is likely to stimulate further exploration of MoE-based protein models at even larger scales. As part of the AIDO platform, AIDO.Protein is positioned for integration with the other AIDO modules — enabling future multimodal workflows that combine DNA, RNA, protein, and cellular context within a coherent computational framework. A current limitation is the absence of native structural conditioning at pretraining time; the model is primarily a sequence-level model with structure-aware variants trained separately, and it does not natively reason over three-dimensional coordinates in the way that structure-aware models like ESM-3 or ProteinMPNN do.