Huazhong University of Science and Technology
A unified bio-language Mixture-of-Experts foundation model spanning DNA, protein sequence and structure, and biological text, applied across eight task families from a single checkpoint.
OmniGene-4 is a unified bio-language foundation model that brings DNA, protein sequence, protein structure, and biological natural-language text into a single generative model. Rather than training separate specialist models for each modality, OmniGene-4 continue-pretrains a general-purpose large language model so that one fixed checkpoint can be prompted to perform a wide range of biological tasks without per-task retraining. It was developed by researchers at Huazhong University of Science and Technology and released as a bioRxiv preprint in May 2026.
The core idea is to extend a Gemma-class Mixture-of-Experts (MoE) backbone — which routes each token through a small subset of 128 experts per layer — with a biology-aware vocabulary. Roughly 28,000 new tokens are added to cover DNA byte-pair encodings (BPE), protein BPE, Foldseek 3Di structural alphabets, and DSSP secondary-structure symbols, allowing nucleotide sequences, amino-acid sequences, and discretized protein structures to be expressed in the same token stream as ordinary text. The model is then continue-pretrained on a roughly 32.5 GB cross-modality corpus and instruction-tuned on about 200,000 examples spanning eight biological task families.
By unifying modalities under a shared language-model interface, OmniGene-4 fits into the emerging class of biological foundation models — alongside efforts in protein language modeling and genomic language modeling — that aim for broad, promptable generality rather than narrow task specialization.
OmniGene-4 is a decoder-style Mixture-of-Experts transformer derived from a Gemma-class backbone, with 128 experts per layer and sparse routing so that only a few experts process each token. The vocabulary is expanded by roughly 28,000 biological tokens: DNA and protein byte-pair encodings represent nucleotide and amino-acid sequences, while Foldseek 3Di and DSSP tokens encode protein structure and secondary structure as discrete symbols. Training proceeds in two stages — continue-pretraining on a ~32.5 GB cross-modality corpus to align the new tokens with the pretrained language model, followed by instruction tuning on ~200,000 examples drawn from eight task families spanning DNA, protein, and text. The exact total parameter count and the release license are not specified in the available preprint, and no public code or weights have been confirmed at the time of writing.
OmniGene-4 targets researchers who want a single promptable model rather than a stack of task-specific tools. Because the same checkpoint handles DNA, protein sequence and structure, and biological text, it can support tasks such as variant interpretation, structure-aware protein reasoning, and biological question answering through a natural-language interface. This generality is particularly useful for exploratory workflows where investigators move across modalities and want a consistent entry point, though task-specific specialist models may still outperform it on individual benchmarks.
OmniGene-4 contributes to the broader push toward unified, multimodal biological foundation models that collapse DNA, protein, structure, and text into one language-model interface. Its use of a Mixture-of-Experts backbone with an expanded biological vocabulary illustrates a practical route to scaling cross-modality capacity while keeping per-token compute bounded. As a recent preprint without confirmed public code, weights, or a stated license, its real-world adoption and independent validation remain to be established, and claims should be read with appropriate caution pending peer review and a released artifact.
Wang, L. (2026) OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability. bioRxiv.
DOI: 10.64898/2026.05.12.724542