OryzaG3

700M-parameter DNA language model pretrained on the rice pangenome, serving as a reusable base model for crop genomics and molecular breeding.

Released: May 2026

Parameters: 700 Million

OryzaG3 is a genomic foundation model for rice (Oryza sativa and related species), developed by researchers at Hainan University and released as a bioRxiv preprint in May 2026. While general-purpose genomic language models such as DNABERT-2 and the Nucleotide Transformer are trained across many species, OryzaG3 takes the opposite approach: it concentrates modelling capacity on a single, agriculturally critical crop by pretraining on the rice pangenome. This crop-focused strategy is intended to capture the regulatory and structural sequence patterns most relevant to rice biology and breeding.

The model addresses a practical gap in plant genomics. Rice is a staple food for more than half of the world's population, and large-scale resequencing efforts have produced extensive genomic variation data, but most genomic language models are not specialised for crop sequences and can be computationally expensive to apply at scale. OryzaG3 is positioned as a reusable base model that downstream researchers can fine-tune for specific tasks in crop genomics and molecular breeding rather than training task-specific models from scratch.

By restricting its training corpus to high-quality rice genomes while retaining a 700M-parameter transformer backbone, OryzaG3 aims to match the performance of broader multi-species models on plant benchmarks while offering substantially faster inference, making it more practical for routine use in breeding pipelines.

Key Features

Rice pangenome pretraining: Trained on approximately 59.20 Gb of sequence assembled from 149 high-quality rice genomes, giving the model broad coverage of the structural and allelic diversity present across the rice pangenome rather than a single reference assembly.
Causal language modelling objective: Uses autoregressive (causal) self-supervised pretraining, learning to predict nucleotide context across the genome without task-specific labels.
3-mer tokenization: Sequences are tokenized into overlapping or non-overlapping 3-mers, a scheme that balances vocabulary size against sequence compression for genomic input.
Efficient inference: Reported to achieve roughly 4x faster inference than comparable multi-species genomic models while remaining competitive on accuracy, lowering the cost of applying the model at genome scale.
Reusable base model: Framed explicitly as a foundation for crop genomics, intended to be fine-tuned for downstream prediction tasks such as genomic variant effects and polyadenylation (polyA) site prediction.

Technical Details

OryzaG3 is a transformer-based DNA language model with approximately 700 million parameters, trained with a causal (autoregressive) language-modelling objective and 3-mer tokenization. Its pretraining corpus comprises roughly 59.20 Gb of sequence drawn from 149 high-quality rice genomes, capturing pangenome-scale variation rather than a single reference. The authors evaluate the model on the Plants Genomic Benchmark, a suite of plant-focused genomic prediction tasks, and report performance competitive with multi-species genomic foundation models while achieving on the order of 4x faster inference. Reported downstream applications include genomic variant prediction and polyA site prediction. Exact benchmark scores and full architectural hyperparameters are described in the preprint; specific figures are reported there and should be confirmed against the primary source.

Applications

OryzaG3 targets crop genomics and molecular breeding. As a pretrained base model, it can be fine-tuned to predict the functional consequences of genomic variants, identify regulatory elements such as polyadenylation sites, and provide sequence representations for other rice genome annotation and prediction tasks. The intended beneficiaries are plant geneticists, breeders, and computational biologists who need a rice-specialised model that runs efficiently over large resequencing datasets, supporting marker discovery and the prioritisation of candidate variants in breeding programmes.

Impact

OryzaG3 reflects a broader trend toward crop-specific and lineage-specific genomic foundation models, contrasting with the multi-species generalist approach of earlier DNA language models. By demonstrating that a single-crop pangenome pretraining corpus can yield a model competitive with broader baselines while running several times faster, the work argues that specialisation can deliver both accuracy and efficiency for agriculturally important genomes. As a recent preprint, its real-world adoption and downstream influence remain to be established, and the results have not yet undergone peer review. A notable practical limitation is that, as of the preprint, no public code or model weights were linked, which may constrain immediate reuse by the community.

Citation

OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Yang, L., et al. (2026) OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome. bioRxiv.

DOI: 10.64898/2026.05.22.727045

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References14

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

19Closed

Usability — can I run it?14

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Rice pangenome pretraining: Trained on approximately 59.20 Gb of sequence assembled from 149 high-quality rice genomes, giving the model broad coverage of the structural and allelic diversity present across the rice pangenome rather than a single reference assembly.

Causal language modelling objective: Uses autoregressive (causal) self-supervised pretraining, learning to predict nucleotide context across the genome without task-specific labels.

3-mer tokenization: Sequences are tokenized into overlapping or non-overlapping 3-mers, a scheme that balances vocabulary size against sequence compression for genomic input.

Efficient inference: Reported to achieve roughly 4x faster inference than comparable multi-species genomic models while remaining competitive on accuracy, lowering the cost of applying the model at genome scale.

Reusable base model: Framed explicitly as a foundation for crop genomics, intended to be fine-tuned for downstream prediction tasks such as genomic variant effects and polyadenylation (polyA) site prediction.

Technical Details

Applications

Impact

OryzaG3

Key Features

Technical Details

Applications

Impact

Citation

OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OryzaG3

Key Features

Technical Details

Applications

Impact

Citation

OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OryzaG3

#Key Features

#Technical Details

#Applications

#Impact

Citation

OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

OryzaG3

#Key Features

#Technical Details

#Applications

#Impact

Citation

OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact