bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

OmniNA

Beijing Institute of Genomics / Chinese Academy of Sciences

Self-supervised generative foundation model jointly trained on 91.7M nucleotide sequences and structured annotations spanning 1.076 trillion bases, achieving SOTA on 23 nucleotide-language benchmarks.

Released: April 2026

OmniNA is a generative DNA foundation model published in Nucleic Acids Research in April 2026. Developed at the Beijing Institute of Genomics under the Chinese Academy of Sciences, OmniNA is trained on 91.7 million nucleotide sequences spanning 1.076 trillion bases, jointly learning over raw sequences and their structured annotations (gene names, species, functional descriptions, ontology terms). This joint training objective bridges raw sequence modeling with semantic annotation learning in a single foundation model.

OmniNA achieves state-of-the-art results across 23 benchmarks covering sequence detection, species classification, and mutation-effect prediction, outperforming prior DNA language models including DNABERT-2, Nucleotide Transformer, and Caduceus on most evaluated tasks.

#Key Features

  • Joint sequence-annotation training: Co-trained on nucleotide sequences and their structured metadata, allowing the model to bridge raw sequence and semantic annotation tasks within one set of weights.
  • Massive training corpus: 91.7M sequences and 1.076 trillion bases drawn from broad public genomic resources, with annotation pulled from RefSeq, Ensembl, and ontology databases.
  • State-of-the-art on 23 benchmarks: Including sequence detection (taxonomic and functional), species classification, mutation-effect prediction, and regulatory-element identification.
  • Generative annotation: Can generate plausible functional annotations conditioned on raw sequence inputs.
  • Open access via NAR: Published in Nucleic Acids Research with code and model weights available.

#Technical Details

OmniNA uses a transformer-based architecture trained with a self-supervised next-token prediction objective over an interleaved sequence+annotation stream. Annotation tokens are introduced into the training corpus alongside nucleotide tokens, allowing the model to learn cross-modal correspondences between sequence and metadata.

Training was performed on standard transformer infrastructure; the published paper provides hyperparameter, ablation, and benchmark details. Evaluation spans 23 downstream tasks, including TF binding-site detection, promoter classification, splice-site identification, taxonomic classification, and missense variant pathogenicity prediction.

#Applications

OmniNA is suited for genomics researchers building automated annotation pipelines, variant interpretation workflows, and species classification tools. The joint sequence-annotation training is particularly useful when annotation data is sparse and the model must propagate semantic information from related sequences. The generative annotation capability supports rapid functional hypothesis generation for previously uncharacterized sequences.

#Impact

OmniNA advances the state of the art in DNA foundation modeling by integrating semantic annotation learning into the pretraining objective rather than treating annotation as a downstream prediction target. The 23-benchmark sweep demonstrates broad applicability and competitive performance against narrowly specialized prior models. Published in Nucleic Acids Research with open weights, OmniNA is well-positioned for adoption in academic genomics workflows.

Citation

A foundation model for nucleotide sequences.

Shen, X., et al. (2026) A foundation model for nucleotide sequences.. Nucleic Acids Research.

DOI: 10.1093/nar/gkag083

Recent citations

Papers that recently cited this model.

  • Benchmarking long-context genome language models on biosynthetic gene clusters

    Keisuke Hirota, Koichi Higashi, Ken Kurokawa, et al.

    bioRxiv · May 2026

    0
  • GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

    Yi Shen, Guangshuo Cao, Jianghong Wu, et al.

    bioRxiv · Apr 2026

    0

Top citations

The most-cited papers that cite this model.

  • Benchmarking long-context genome language models on biosynthetic gene clusters

    Keisuke Hirota, Koichi Higashi, Ken Kurokawa, et al.

    bioRxiv · May 2026

    0
  • GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

    Yi Shen, Guangshuo Cao, Jianghong Wu, et al.

    bioRxiv · Apr 2026

    0

Citations

Total Citations2
Influential0
References50

GitHub

Stars0
Forks0
Open Issues0
Contributors1
Last Push5mo ago
LanguagePython
LicenseMIT

HuggingFace

Downloads3
Likes0
Last Modified5mo ago

Fields of citing research

  • Biology100%
  • Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
42Partial
Usability — can I run it?45
Reproducibility — can I retrain it?44
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

dnafoundation_modelgenomegenomic_annotationmutation_effect_predictionnucleotideself_supervisedsequence_detectionspecies_classificationtransformer

Resources

GitHub RepositoryResearch PaperResearch PaperHuggingFace Model