Colorado State University / University of Michigan
A supervised, chromatin-informed foundation model that predicts regulatory activity directly from plant genomic sequence in Arabidopsis and rice.
Sequence-to-function deep learning models have transformed regulatory genomics by learning to predict molecular phenotypes directly from DNA sequence, but the vast majority of this progress has concentrated on human and mammalian genomes. Plant regulatory genomics has remained comparatively underexplored, despite its importance for crop improvement and basic plant biology. Deep-Plant, introduced in a 2026 bioRxiv preprint from researchers at Colorado State University and the University of Michigan, addresses this gap with a supervised foundation model trained to predict chromatin state directly from plant genomic sequence.
Rather than following the self-supervised DNA language model paradigm—where a model learns from raw sequence alone—Deep-Plant is trained on a large collection of genome-wide functional experiments. This supervised, chromatin-informed pretraining gives the model biological context beyond the sequence itself, which the authors position as a more practical and effective alternative to fine-tuning general-purpose DNA language models for plants. The design follows the spirit of human models such as Enformer, adapted to the data and species of the plant kingdom.
The pretrained chromatin model serves as a reusable backbone that is then fine-tuned for downstream regulatory tasks. Deep-Plant models are released for Arabidopsis thaliana and rice (Oryza sativa), and the authors show they transfer usefully as a building block for related species such as corn (maize).
Deep-Plant is a supervised sequence-to-function model that operates on fixed 2.5 kb input windows, with sequences center-cropped or padded to length. The pretraining objective predicts chromatin state profiles—derived from DNA accessibility, transcription factor binding, and histone modification assays—and the resulting representation is fine-tuned for gene expression and enhancer activity readouts. The authors report large improvements in speed, accuracy, and interpretability relative to the complementary approach of fine-tuning self-supervised DNA language models on the same plant tasks. Pretrained weights (~9.9 GB across tasks and species) and training data (~26.5 GB) are distributed via Zenodo, and a command-line tool accepts FASTA sequences, genomic loci, or gene identifiers as input. Exact parameter counts and the full architecture specification are detailed in the configuration files of the code release rather than summarized here.
Deep-Plant is aimed at plant genomicists and crop scientists who need accurate, interpretable predictions of regulatory activity from sequence. Concrete use cases include annotating chromatin state and candidate enhancers across the genome, predicting gene expression from promoter and regulatory sequence, and scoring the likely functional impact of natural or engineered variants—work directly relevant to breeding, trait dissection, and synthetic promoter design. Because the model transfers to related species, researchers studying crops without their own large functional genomics datasets can use the Arabidopsis or rice backbones as a starting point.
By demonstrating that supervised, chromatin-informed pretraining can outperform the fine-tuning of DNA language models on plant regulatory tasks, Deep-Plant offers the plant genomics community an Enformer-style foundation model tailored to its organisms and data. It helps close the gap between the rapidly advancing human regulatory genomics toolkit and the comparatively under-resourced plant field. As a preprint, its benchmark claims await peer review, and downstream adoption will depend on validation across additional species and assays; the open release of weights, data, and tooling lowers the barrier for the community to build on and test the approach.