PlantCAD2

Long-context plant DNA language model, 676M parameters on a Mamba2 backbone, pretrained on 65 angiosperm genomes for cross-species variant annotation.

Released: April 2026

Parameters: 676 Million

PlantCAD2 is a long-context, plant-specific DNA language model that learns the grammar of flowering-plant genomes at single-nucleotide resolution. It addresses a persistent gap in genomics: while human DNA models have proliferated, most are poorly suited to the deep evolutionary diversity and gene-dense architecture of plant genomes. By training across many species rather than a single reference, PlantCAD2 captures evolutionary conservation signals that flag functional and deleterious sites, enabling annotation and variant interpretation across diverse angiosperms without task-specific labels.

Developed by researchers at Cornell University (Institute for Genomic Diversity, the Kuleshov Group in Computer Science, and Plant Breeding and Genetics), with collaborators at the Open Athena AI Foundation and USDA-ARS, PlantCAD2 is the successor to PlantCAD (PlantCaduceus). It extends that work from a 512 bp context to 8,192 bp and from 16 to 65 genomes, while swapping the underlying sequence backbone to the more efficient Mamba2 architecture. The model was first posted to bioRxiv in August 2025, with a revised version (retitled "PlantCAD2: a DNA foundation model for interpreting genomes across flowering plants") released in April 2026.

Notably, the 676-million-parameter PlantCAD2 surpasses the 7-billion-parameter Evo2 on 10 of 12 zero-shot benchmarks, demonstrating that a smaller, domain-focused model trained on the right evolutionary signal can outperform a much larger general-purpose genome model on plant tasks.

Key Features

Cross-species pretraining: Trained on 65 angiosperm genomes (one representative species per genus, drawn from Phytozome), letting the model learn conservation patterns that generalize across flowering plants rather than memorizing a single reference.
Mamba2 backbone: Built on bidirectional Mamba2 state-space blocks that scale linearly with sequence length, retaining reverse-complement equivariance and single-nucleotide tokenization for strand-invariant modeling.
Long context: Supports input windows up to 8,192 bp, a 16-fold increase over PlantCAD, capturing longer-range regulatory structure (though conservation performance plateaus near 4,096 bp).
Three model sizes: Released as Small (~88M), Medium (~311M), and Large (~676M / 694M) checkpoints, letting users trade compute for accuracy.
Zero-shot functional signal: Recovers splice sites, translation initiation/termination sites, and structural-variant effects without fine-tuning, plus LoRA-tuned heads for chromatin accessibility, gene expression, and protein translation.

Technical Details

PlantCAD2 is pretrained with a masked language modeling objective on overlapping 512–8,192 bp windows (4,096 bp step) extracted from gene-centered regions of 65 genomes. The largest variant (48 layers, 1,536 hidden dimensions, ~676M parameters) trained for 28 days on 64 NVIDIA H100 GPUs. On zero-shot benchmarks it beats Evo2 7B across splice-donor (AUROC 0.910 vs 0.741) and splice-acceptor (0.900 vs 0.738) prediction, translation initiation-site recovery in maize (0.657 vs 0.447), Andropogoneae conservation (AUROC 0.725 vs 0.691), and structural variant effect prediction (AUPRC 0.841 vs 0.771), among others. Fine-tuned results include cross-species leaf expression (AUROC 0.854) and chromatin accessibility (AUPRC 0.409). Weights are distributed through the kuleshov-group HuggingFace collection, and the training corpus is published as the Angiosperm_65_genomes_8192bp dataset (~3.3M windows, 27 GB).

Applications

PlantCAD2 supports cross-species genome annotation, prioritization of regulatory and coding variants, and prediction of splice sites, transcription factor binding sites, chromatin accessibility, and gene expression across crops and wild relatives. Plant breeders and crop genomicists can use its conservation-aware scores to flag deleterious mutations and candidate causal variants in under-annotated species, while functional genomics labs can use embeddings or LoRA-tuned heads to transfer annotation from well-studied models like maize, rice, and Arabidopsis to newly sequenced genomes.

Impact

PlantCAD2 advances plant genomics by showing that a compact, evolution-aware DNA foundation model can outperform a general-purpose model an order of magnitude larger on the tasks plant biologists care about, lowering the compute barrier for agricultural research groups. As a multi-species successor to PlantCAD with released weights, training data, and fine-tuning recipes, it provides a practical backbone for crop-improvement pipelines and downstream tools such as genome annotation systems. Its main limitations stem from preprint status: results await peer review, the corpus centers on gene-proximal regions and Phytozome species, and the CC BY-NC license restricts commercial use of the released models.

Citation

PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms

Preprint

Zhai, J., et al. (2025) PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms. bioRxiv.

DOI: 10.1101/2025.08.27.672609

Recent citations

Papers that recently cited this model.

Convergent genome- and gene-level constraints shape repeated environmental adaptation in grasses
Sheng-Kai Hsu, Aimee J. Schulz, Charles O. Hale, et al.
bioRxiv · Jun 2026
0
Genomic properties representing plant sex chromosome evolution interpreted with genome language models
Takashi Akagi, Hikaru Matsuoka, J. Takayama, et al.
bioRxiv · May 2026
0
BOTANIC-0: a series of foundation models for plant genomic data
J. Ogier du Terrail, Tanguy Marchand, V. Cabeli, et al.
bioRxiv · Mar 2026
1Influential

Top citations

The most-cited papers that cite this model.

BOTANIC-0: a series of foundation models for plant genomic data
J. Ogier du Terrail, Tanguy Marchand, V. Cabeli, et al.
bioRxiv · Mar 2026
1Influential
GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Zong-Yan Liu, Ana Berthel, Eric Czech, et al.
bioRxiv · Nov 2025
0
Convergent genome- and gene-level constraints shape repeated environmental adaptation in grasses
Sheng-Kai Hsu, Aimee J. Schulz, Charles O. Hale, et al.
bioRxiv · Jun 2026
0
Genomic properties representing plant sex chromosome evolution interpreted with genome language models
Takashi Akagi, Hikaru Matsuoka, J. Takayama, et al.
bioRxiv · May 2026
0

Citations

Total Citations0

Influential0

References0

GitHub

Stars97

Forks14

Open Issues1

Contributors5

Last Push9d ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads2.8K

Likes2

Last Modified10mo ago

Fields of citing research

Biology100%
Environmental Science100%
Computer Science50%
Medicine25%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

69Partial

Usability — can I run it?95

Reproducibility — can I retrain it?56

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Cross-species pretraining: Trained on 65 angiosperm genomes (one representative species per genus, drawn from Phytozome), letting the model learn conservation patterns that generalize across flowering plants rather than memorizing a single reference.

Mamba2 backbone: Built on bidirectional Mamba2 state-space blocks that scale linearly with sequence length, retaining reverse-complement equivariance and single-nucleotide tokenization for strand-invariant modeling.

Long context: Supports input windows up to 8,192 bp, a 16-fold increase over PlantCAD, capturing longer-range regulatory structure (though conservation performance plateaus near 4,096 bp).

Three model sizes: Released as Small (~88M), Medium (~311M), and Large (~676M / 694M) checkpoints, letting users trade compute for accuracy.

Zero-shot functional signal: Recovers splice sites, translation initiation/termination sites, and structural-variant effects without fine-tuning, plus LoRA-tuned heads for chromatin accessibility, gene expression, and protein translation.

Technical Details

Applications

Impact

PlantCAD2

#Key Features

#Technical Details

#Applications

#Impact

Citation

PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PlantCAD2

#Key Features

#Technical Details

#Applications

#Impact

Citation

PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact