A long-context plant DNA language model (676M params, Mamba2) pretrained on 65 angiosperm genomes for cross-species functional annotation.
PlantCAD2 is a long-context, plant-specific DNA language model that learns the grammar of flowering-plant genomes at single-nucleotide resolution. It addresses a persistent gap in genomics: while human DNA models have proliferated, most are poorly suited to the deep evolutionary diversity and gene-dense architecture of plant genomes. By training across many species rather than a single reference, PlantCAD2 captures evolutionary conservation signals that flag functional and deleterious sites, enabling annotation and variant interpretation across diverse angiosperms without task-specific labels.
Developed by researchers at Cornell University (Institute for Genomic Diversity, the Kuleshov Group in Computer Science, and Plant Breeding and Genetics), with collaborators at the Open Athena AI Foundation and USDA-ARS, PlantCAD2 is the successor to PlantCAD (PlantCaduceus). It extends that work from a 512 bp context to 8,192 bp and from 16 to 65 genomes, while swapping the underlying sequence backbone to the more efficient Mamba2 architecture. The model was first posted to bioRxiv in August 2025, with a revised version (retitled "PlantCAD2: a DNA foundation model for interpreting genomes across flowering plants") released in April 2026.
Notably, the 676-million-parameter PlantCAD2 surpasses the 7-billion-parameter Evo2 on 10 of 12 zero-shot benchmarks, demonstrating that a smaller, domain-focused model trained on the right evolutionary signal can outperform a much larger general-purpose genome model on plant tasks.
PlantCAD2 is pretrained with a masked language modeling objective on overlapping
512–8,192 bp windows (4,096 bp step) extracted from gene-centered regions of 65
genomes. The largest variant (48 layers, 1,536 hidden dimensions, ~676M
parameters) trained for 28 days on 64 NVIDIA H100 GPUs. On zero-shot benchmarks
it beats Evo2 7B across splice-donor (AUROC 0.910 vs 0.741) and splice-acceptor
(0.900 vs 0.738) prediction, translation initiation-site recovery in maize (0.657
vs 0.447), Andropogoneae conservation (AUROC 0.725 vs 0.691), and structural
variant effect prediction (AUPRC 0.841 vs 0.771), among others. Fine-tuned
results include cross-species leaf expression (AUROC 0.854) and chromatin
accessibility (AUPRC 0.409). Weights are distributed through the kuleshov-group
HuggingFace collection, and the training corpus is published as the
Angiosperm_65_genomes_8192bp dataset (~3.3M windows, 27 GB).
PlantCAD2 supports cross-species genome annotation, prioritization of regulatory and coding variants, and prediction of splice sites, transcription factor binding sites, chromatin accessibility, and gene expression across crops and wild relatives. Plant breeders and crop genomicists can use its conservation-aware scores to flag deleterious mutations and candidate causal variants in under-annotated species, while functional genomics labs can use embeddings or LoRA-tuned heads to transfer annotation from well-studied models like maize, rice, and Arabidopsis to newly sequenced genomes.
PlantCAD2 advances plant genomics by showing that a compact, evolution-aware DNA foundation model can outperform a general-purpose model an order of magnitude larger on the tasks plant biologists care about, lowering the compute barrier for agricultural research groups. As a multi-species successor to PlantCAD with released weights, training data, and fine-tuning recipes, it provides a practical backbone for crop-improvement pipelines and downstream tools such as genome annotation systems. Its main limitations stem from preprint status: results await peer review, the corpus centers on gene-proximal regions and Phytozome species, and the CC BY-NC license restricts commercial use of the released models.