bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene

PlantCAD2

Cornell University

A long-context plant DNA language model (676M params, Mamba2) pretrained on 65 angiosperm genomes for cross-species functional annotation.

Released: April 2026
Parameters: 676 Million

PlantCAD2 is a long-context, plant-specific DNA language model that learns the grammar of flowering-plant genomes at single-nucleotide resolution. It addresses a persistent gap in genomics: while human DNA models have proliferated, most are poorly suited to the deep evolutionary diversity and gene-dense architecture of plant genomes. By training across many species rather than a single reference, PlantCAD2 captures evolutionary conservation signals that flag functional and deleterious sites, enabling annotation and variant interpretation across diverse angiosperms without task-specific labels.

Developed by researchers at Cornell University (Institute for Genomic Diversity, the Kuleshov Group in Computer Science, and Plant Breeding and Genetics), with collaborators at the Open Athena AI Foundation and USDA-ARS, PlantCAD2 is the successor to PlantCAD (PlantCaduceus). It extends that work from a 512 bp context to 8,192 bp and from 16 to 65 genomes, while swapping the underlying sequence backbone to the more efficient Mamba2 architecture. The model was first posted to bioRxiv in August 2025, with a revised version (retitled "PlantCAD2: a DNA foundation model for interpreting genomes across flowering plants") released in April 2026.

Notably, the 676-million-parameter PlantCAD2 surpasses the 7-billion-parameter Evo2 on 10 of 12 zero-shot benchmarks, demonstrating that a smaller, domain-focused model trained on the right evolutionary signal can outperform a much larger general-purpose genome model on plant tasks.

#Key Features

  • Cross-species pretraining: Trained on 65 angiosperm genomes (one representative species per genus, drawn from Phytozome), letting the model learn conservation patterns that generalize across flowering plants rather than memorizing a single reference.
  • Mamba2 backbone: Built on bidirectional Mamba2 state-space blocks that scale linearly with sequence length, retaining reverse-complement equivariance and single-nucleotide tokenization for strand-invariant modeling.
  • Long context: Supports input windows up to 8,192 bp, a 16-fold increase over PlantCAD, capturing longer-range regulatory structure (though conservation performance plateaus near 4,096 bp).
  • Three model sizes: Released as Small (~88M), Medium (~311M), and Large (~676M / 694M) checkpoints, letting users trade compute for accuracy.
  • Zero-shot functional signal: Recovers splice sites, translation initiation/termination sites, and structural-variant effects without fine-tuning, plus LoRA-tuned heads for chromatin accessibility, gene expression, and protein translation.

#Technical Details

PlantCAD2 is pretrained with a masked language modeling objective on overlapping 512–8,192 bp windows (4,096 bp step) extracted from gene-centered regions of 65 genomes. The largest variant (48 layers, 1,536 hidden dimensions, ~676M parameters) trained for 28 days on 64 NVIDIA H100 GPUs. On zero-shot benchmarks it beats Evo2 7B across splice-donor (AUROC 0.910 vs 0.741) and splice-acceptor (0.900 vs 0.738) prediction, translation initiation-site recovery in maize (0.657 vs 0.447), Andropogoneae conservation (AUROC 0.725 vs 0.691), and structural variant effect prediction (AUPRC 0.841 vs 0.771), among others. Fine-tuned results include cross-species leaf expression (AUROC 0.854) and chromatin accessibility (AUPRC 0.409). Weights are distributed through the kuleshov-group HuggingFace collection, and the training corpus is published as the Angiosperm_65_genomes_8192bp dataset (~3.3M windows, 27 GB).

#Applications

PlantCAD2 supports cross-species genome annotation, prioritization of regulatory and coding variants, and prediction of splice sites, transcription factor binding sites, chromatin accessibility, and gene expression across crops and wild relatives. Plant breeders and crop genomicists can use its conservation-aware scores to flag deleterious mutations and candidate causal variants in under-annotated species, while functional genomics labs can use embeddings or LoRA-tuned heads to transfer annotation from well-studied models like maize, rice, and Arabidopsis to newly sequenced genomes.

#Impact

PlantCAD2 advances plant genomics by showing that a compact, evolution-aware DNA foundation model can outperform a general-purpose model an order of magnitude larger on the tasks plant biologists care about, lowering the compute barrier for agricultural research groups. As a multi-species successor to PlantCAD with released weights, training data, and fine-tuning recipes, it provides a practical backbone for crop-improvement pipelines and downstream tools such as genome annotation systems. Its main limitations stem from preprint status: results await peer review, the corpus centers on gene-proximal regions and Phytozome species, and the CC BY-NC license restricts commercial use of the released models.

Tags

variant_effect_predictionfunctional_annotationgene_expressionstate_space_modelmamba2foundation_modelself_supervisedzero_shotgenomicssplicing