bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language modelDNA & GeneProtein

OmniGene-4

Huazhong University of Science and Technology

A unified bio-language Mixture-of-Experts foundation model spanning DNA, protein sequence and structure, and biological text, applied across eight task families from a single checkpoint.

Released: May 2026

OmniGene-4 is a unified bio-language foundation model that brings DNA, protein sequence, protein structure, and biological natural-language text into a single generative model. Rather than training separate specialist models for each modality, OmniGene-4 continue-pretrains a general-purpose large language model so that one fixed checkpoint can be prompted to perform a wide range of biological tasks without per-task retraining. It was developed by researchers at Huazhong University of Science and Technology and released as a bioRxiv preprint in May 2026.

The core idea is to extend a Gemma-class Mixture-of-Experts (MoE) backbone — which routes each token through a small subset of 128 experts per layer — with a biology-aware vocabulary. Roughly 28,000 new tokens are added to cover DNA byte-pair encodings (BPE), protein BPE, Foldseek 3Di structural alphabets, and DSSP secondary-structure symbols, allowing nucleotide sequences, amino-acid sequences, and discretized protein structures to be expressed in the same token stream as ordinary text. The model is then continue-pretrained on a roughly 32.5 GB cross-modality corpus and instruction-tuned on about 200,000 examples spanning eight biological task families.

By unifying modalities under a shared language-model interface, OmniGene-4 fits into the emerging class of biological foundation models — alongside efforts in protein language modeling and genomic language modeling — that aim for broad, promptable generality rather than narrow task specialization.

#Key Features

  • Unified multimodal vocabulary: Adds ~28,000 biological tokens covering DNA BPE, protein BPE, Foldseek 3Di, and DSSP, so sequence, structure, and text share one token stream.
  • Mixture-of-Experts backbone: Built on a Gemma-class MoE architecture with 128 experts per layer, enabling large effective capacity while activating only a subset of parameters per token.
  • Single checkpoint, many tasks: A fixed checkpoint is applied across eight biological task families without task-specific retraining.
  • Cross-modality pretraining: Continue-pretrained on a ~32.5 GB corpus that jointly spans DNA, protein sequence and structure, and biological text.
  • Instruction-tuned interface: About 200,000 instruction examples let users query the model in natural language across diverse biological tasks.

#Technical Details

OmniGene-4 is a decoder-style Mixture-of-Experts transformer derived from a Gemma-class backbone, with 128 experts per layer and sparse routing so that only a few experts process each token. The vocabulary is expanded by roughly 28,000 biological tokens: DNA and protein byte-pair encodings represent nucleotide and amino-acid sequences, while Foldseek 3Di and DSSP tokens encode protein structure and secondary structure as discrete symbols. Training proceeds in two stages — continue-pretraining on a ~32.5 GB cross-modality corpus to align the new tokens with the pretrained language model, followed by instruction tuning on ~200,000 examples drawn from eight task families spanning DNA, protein, and text. The exact total parameter count and the release license are not specified in the available preprint, and no public code or weights have been confirmed at the time of writing.

#Applications

OmniGene-4 targets researchers who want a single promptable model rather than a stack of task-specific tools. Because the same checkpoint handles DNA, protein sequence and structure, and biological text, it can support tasks such as variant interpretation, structure-aware protein reasoning, and biological question answering through a natural-language interface. This generality is particularly useful for exploratory workflows where investigators move across modalities and want a consistent entry point, though task-specific specialist models may still outperform it on individual benchmarks.

#Impact

OmniGene-4 contributes to the broader push toward unified, multimodal biological foundation models that collapse DNA, protein, structure, and text into one language-model interface. Its use of a Mixture-of-Experts backbone with an expanded biological vocabulary illustrates a practical route to scaling cross-modality capacity while keeping per-token compute bounded. As a recent preprint without confirmed public code, weights, or a stated license, its real-world adoption and independent validation remain to be established, and claims should be read with appropriate caution pending peer review and a released artifact.

Citation

OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Wang, L. (2026) OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability. bioRxiv.

DOI: 10.64898/2026.05.12.724542

Openness

Unclassified
Missing required components

Tags

dnafoundation_modelinstruction_followinginstruction_tuningmixture_of_expertsmultimodalproteomicsstructure_predictiontransformervariant_effect_prediction

Resources

Research Paper