A 700M-parameter DNA language model pretrained on the rice pangenome, built as a reusable foundation model for crop genomics and molecular breeding.
OryzaG3 is a genomic foundation model for rice (Oryza sativa and related species), developed by researchers at Hainan University and released as a bioRxiv preprint in May 2026. While general-purpose genomic language models such as DNABERT-2 and the Nucleotide Transformer are trained across many species, OryzaG3 takes the opposite approach: it concentrates modelling capacity on a single, agriculturally critical crop by pretraining on the rice pangenome. This crop-focused strategy is intended to capture the regulatory and structural sequence patterns most relevant to rice biology and breeding.
The model addresses a practical gap in plant genomics. Rice is a staple food for more than half of the world's population, and large-scale resequencing efforts have produced extensive genomic variation data, but most genomic language models are not specialised for crop sequences and can be computationally expensive to apply at scale. OryzaG3 is positioned as a reusable base model that downstream researchers can fine-tune for specific tasks in crop genomics and molecular breeding rather than training task-specific models from scratch.
By restricting its training corpus to high-quality rice genomes while retaining a 700M-parameter transformer backbone, OryzaG3 aims to match the performance of broader multi-species models on plant benchmarks while offering substantially faster inference, making it more practical for routine use in breeding pipelines.
OryzaG3 is a transformer-based DNA language model with approximately 700 million parameters, trained with a causal (autoregressive) language-modelling objective and 3-mer tokenization. Its pretraining corpus comprises roughly 59.20 Gb of sequence drawn from 149 high-quality rice genomes, capturing pangenome-scale variation rather than a single reference. The authors evaluate the model on the Plants Genomic Benchmark, a suite of plant-focused genomic prediction tasks, and report performance competitive with multi-species genomic foundation models while achieving on the order of 4x faster inference. Reported downstream applications include genomic variant prediction and polyA site prediction. Exact benchmark scores and full architectural hyperparameters are described in the preprint; specific figures are reported there and should be confirmed against the primary source.
OryzaG3 targets crop genomics and molecular breeding. As a pretrained base model, it can be fine-tuned to predict the functional consequences of genomic variants, identify regulatory elements such as polyadenylation sites, and provide sequence representations for other rice genome annotation and prediction tasks. The intended beneficiaries are plant geneticists, breeders, and computational biologists who need a rice-specialised model that runs efficiently over large resequencing datasets, supporting marker discovery and the prioritisation of candidate variants in breeding programmes.
OryzaG3 reflects a broader trend toward crop-specific and lineage-specific genomic foundation models, contrasting with the multi-species generalist approach of earlier DNA language models. By demonstrating that a single-crop pangenome pretraining corpus can yield a model competitive with broader baselines while running several times faster, the work argues that specialisation can deliver both accuracy and efficiency for agriculturally important genomes. As a recent preprint, its real-world adoption and downstream influence remain to be established, and the results have not yet undergone peer review. A notable practical limitation is that, as of the preprint, no public code or model weights were linked, which may constrain immediate reuse by the community.
Yang, L., et al. (2026) OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome. bioRxiv.
DOI: 10.64898/2026.05.22.727045