Beijing Institute of Genomics / Chinese Academy of Sciences
Self-supervised generative foundation model jointly trained on 91.7M nucleotide sequences and structured annotations spanning 1.076 trillion bases, achieving SOTA on 23 nucleotide-language benchmarks.
OmniNA is a generative DNA foundation model published in Nucleic Acids Research in April 2026. Developed at the Beijing Institute of Genomics under the Chinese Academy of Sciences, OmniNA is trained on 91.7 million nucleotide sequences spanning 1.076 trillion bases, jointly learning over raw sequences and their structured annotations (gene names, species, functional descriptions, ontology terms). This joint training objective bridges raw sequence modeling with semantic annotation learning in a single foundation model.
OmniNA achieves state-of-the-art results across 23 benchmarks covering sequence detection, species classification, and mutation-effect prediction, outperforming prior DNA language models including DNABERT-2, Nucleotide Transformer, and Caduceus on most evaluated tasks.
OmniNA uses a transformer-based architecture trained with a self-supervised next-token prediction objective over an interleaved sequence+annotation stream. Annotation tokens are introduced into the training corpus alongside nucleotide tokens, allowing the model to learn cross-modal correspondences between sequence and metadata.
Training was performed on standard transformer infrastructure; the published paper provides hyperparameter, ablation, and benchmark details. Evaluation spans 23 downstream tasks, including TF binding-site detection, promoter classification, splice-site identification, taxonomic classification, and missense variant pathogenicity prediction.
OmniNA is suited for genomics researchers building automated annotation pipelines, variant interpretation workflows, and species classification tools. The joint sequence-annotation training is particularly useful when annotation data is sparse and the model must propagate semantic information from related sequences. The generative annotation capability supports rapid functional hypothesis generation for previously uncharacterized sequences.
OmniNA advances the state of the art in DNA foundation modeling by integrating semantic annotation learning into the pretraining objective rather than treating annotation as a downstream prediction target. The 23-benchmark sweep demonstrates broad applicability and competitive performance against narrowly specialized prior models. Published in Nucleic Acids Research with open weights, OmniNA is well-positioned for adoption in academic genomics workflows.
Shen, X., et al. (2026) A foundation model for nucleotide sequences.. Nucleic Acids Research.
DOI: 10.1093/nar/gkag083