bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

PlantGeneAnn

Huazhong Agricultural University

A strand-specific plant genome foundation model for ab initio gene structure annotation, predicting genes, CDSs, and exons at single-nucleotide resolution.

Released: June 2026
Parameters: 200 Million

PlantGeneAnn is a genome foundation model that performs ab initio gene structure annotation of plant genomes directly from DNA sequence, without requiring transcript or homology evidence. Given a genomic sequence, it predicts complete gene structures — including genes, coding sequences (CDSs), and exons — at single-nucleotide resolution on both the forward and reverse strands. It was introduced in a June 2026 bioRxiv preprint by researchers at Huazhong Agricultural University.

Gene structure annotation is a foundational step in plant genomics, yet conventional pipelines depend heavily on RNA-seq, protein homology, and species-specific training, which limits their utility for newly sequenced or under-studied plant species. PlantGeneAnn reframes annotation as a sequence-labeling problem: a pretrained genomic encoder produces per-nucleotide representations that a segmentation head converts into gene-element predictions. The model's explicit strand-specific design lets it distinguish features on the sense and antisense strands rather than collapsing them, an important distinction in compact plant genomes with overlapping or densely packed loci.

A central finding of the work is that annotation quality matters more than raw data volume during pretraining: a variant fine-tuned on just nine high-quality model-plant annotations outperformed a counterpart trained on 42 species. PlantGeneAnn builds on the PlantBiMoE architecture from the same research group, but is a distinct model with its own pretraining data, a new segmentation head, an extended context window, and separately released weights.

#Key Features

  • Ab initio annotation: Predicts gene structures from DNA sequence alone, removing the dependence on transcript evidence or homology that constrains conventional annotation pipelines, which is especially valuable for newly sequenced and non-model plants.
  • Strand-specific prediction: Resolves genes, CDSs, and exons separately on the forward and reverse strands, capturing the directional organization of plant loci.
  • Single-nucleotide resolution: A 1D U-Net segmentation head maps the encoder's per-base representations to fine-grained gene-element labels.
  • Long-context modeling: Supports input sequences up to 49,152 bp, allowing whole gene loci with their introns and flanking regions to be processed in a single pass.
  • Adaptable foundation model: Beyond annotation, the encoder can be fine-tuned to predict additional omic signals such as RNA-seq and ATAC-seq coverage.

#Technical Details

PlantGeneAnn pairs a 116M-parameter PlantBiMoE encoder — a bidirectional Mamba state-space backbone with a sparse Mixture-of-Experts feedforward design — with a custom 1D U-Net segmentation head, for roughly 200M parameters total and a 49,152 bp context window. Two variants were released: a model-plant version trained on 18 billion tokens from nine high-quality model plant genomes (30 hours on 4 NVIDIA A800-80GB GPUs), and a multi-species version trained on 72 billion tokens from 42 plant genomes drawn from NCBI RefSeq (120 hours on the same hardware). Both were optimized with AdamW (learning rate 1e-4, weight decay 0.01) under a cosine decay schedule and emit per-nucleotide genomic-feature probabilities alongside 1,024-dimensional sequence embeddings.

The model was evaluated on a 13-species benchmark spanning rosids, asterids, and monocots, where it surpassed four state-of-the-art baselines across five evaluation levels. In zero-shot variant effect prediction, PlantGeneAnn identified cryptic splice donor sites and premature stop codons in maize and rice without task-specific training. Notably, the nine-species model outperformed the 42-species model, indicating that high-quality annotations contributed more to performance than the volume of pretraining data.

#Applications

PlantGeneAnn is aimed at plant and crop genomics researchers who need accurate gene annotations for genome assemblies that lack the deep transcriptomic and homology support assumed by conventional tools. It is well suited to annotating newly sequenced crops and under-studied species, and its zero-shot variant effect capability can help prioritize candidate functional variants — such as those introducing cryptic splice sites or premature stop codons — in breeding and functional-genomics studies. As an adaptable foundation model, its encoder can also be fine-tuned for related single-nucleotide-resolution tasks like predicting RNA-seq or ATAC-seq signal.

#Impact

PlantGeneAnn extends genome foundation models from representation learning into end-to-end structural annotation, a task historically handled by evidence-based pipelines rather than learned sequence models. Its demonstration that nine carefully annotated genomes can outperform 42 species reframes how training data is curated for plant genomic models, prioritizing annotation quality over species breadth. The code is released on GitHub under the MIT license, and both model variants are distributed on Hugging Face, lowering the barrier for agricultural labs to apply foundation-model annotation. As a preprint released under a non-commercial license and awaiting peer review, its long-term influence on plant genome annotation remains to be established.

Citation

DOI: 10.64898/2026.06.25.733695

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

GitHub

Stars8
Forks2
Open Issues0
Contributors4
Last Push2d ago
LanguageJupyter Notebook
LicenseMIT

HuggingFace

Downloads71
Likes2
Last Modified2d ago
Pipelinefeature-extraction

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
66Partial
Usability — can I run it?92
Reproducibility — can I retrain it?46
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

dnafoundation_modelgene_structure_annotationgenomicsmixture_of_expertssegmentationstate_space_modelu_netvariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace ModelHuggingFace Model