An autoregressive nucleotide-and-text foundation model pretrained on ~416B characters from 2,221 eukaryotic species for natural-language-guided conditional generation of DNA and RNA sequences.
GenNA is an autoregressive foundation model that jointly models nucleotide sequences and natural-language annotations, enabling controllable generation of DNA and RNA sequences from text prompts. Developed by researchers at Zhejiang University (affiliated with the Zhejiang Key Laboratory of Multi-omics Precision Diagnosis and Treatment of Liver Diseases at Zhejiang University School of Medicine) and released as a bioRxiv preprint in April 2026, the model addresses a gap in genomic AI: most nucleotide language models excel at representation learning or unconditional sequence prediction but offer little control over the biological function of what they generate.
The model's central idea is to treat genomic and transcriptomic sequences as a language that can be conditioned on functional semantics expressed in natural text. By prepending natural-language prompts describing species identity, gene names, and functional annotations to each training sequence, GenNA learns associations between functional semantics, species context, and the underlying sequence patterns. At inference time, this lets a user specify a desired biological context in plain language and sample nucleotide sequences consistent with it.
GenNA sits alongside generative genomic models such as DNAGPT, GENERator, and Evo, but differs in its explicit multimodal coupling of free-text functional guidance with sequence generation across a broad eukaryotic corpus rather than relying solely on special organism tokens or unconditional autoregression.
GenNA is a decoder-only (causal) transformer trained with a next-token prediction objective. The flagship model has approximately 3.6 billion parameters, with a smaller ~0.36B variant also reported. It uses character-level tokenization and a context window of roughly 20,000 raw characters, sufficient to cover full-length contexts for most transcripts and many genomic loci. Pretraining draws on a multimodal corpus covering 2,221 eukaryotic species and totaling approximately 416 billion characters, in which natural-language prompts (species identity, gene names, functional annotations) precede each nucleotide sequence so the model conditions generation on functional semantics. To evaluate whether generated and scored sequences respect biological constraints, the authors used perplexity-based analyses — for example, testing single-nucleotide substitutions and single-base deletions across untranslated regions (5′ and 3′ UTRs) of protein-coding genes in an independent validation set — and report that the model assigns higher perplexity to sequence–function mismatches, indicating sensitivity to functionally meaningful changes.
GenNA is aimed at researchers who want to design or interrogate nucleotide sequences with explicit functional or species context. Synthetic biologists can prompt the model in natural language to generate candidate regulatory elements, coding sequences, or transcripts conditioned on a target species and function, supporting design-of-experiments workflows. Because the model couples text with sequence, it can also be used to score how well a sequence matches a stated function, helping prioritize variants or flag function-disrupting mutations in UTRs and coding regions. Comparative and functional genomics groups benefit from the broad eukaryotic training corpus when working with species that are poorly represented in single-organism models.
GenNA extends the trend of adapting autoregressive language models to genomics by adding an explicit natural-language control channel over a large multi-species eukaryotic corpus, moving from representation and unconditional generation toward instruction-style, function-aware sequence design. As a recent preprint released under a CC BY license, its long-term adoption and downstream influence remain to be established, and independent benchmarking against established generative genomic models will be needed to position its capabilities. Practical limitations include the inherent character-level context ceiling of roughly 20,000 characters, which constrains very long genomic regions, and the dependence of conditional generation quality on the coverage and accuracy of the natural-language annotations available for a given species or gene.