GenNA

Autoregressive nucleotide-and-text foundation model generating DNA and RNA sequences from natural-language prompts that name species and function.

Released: April 2026

Parameters: 3.6 Billion

GenNA is an autoregressive foundation model that jointly models nucleotide sequences and natural-language annotations, enabling controllable generation of DNA and RNA sequences from text prompts. Developed by researchers at Zhejiang University (affiliated with the Zhejiang Key Laboratory of Multi-omics Precision Diagnosis and Treatment of Liver Diseases at Zhejiang University School of Medicine) and released as a bioRxiv preprint in April 2026, the model addresses a gap in genomic AI: most nucleotide language models excel at representation learning or unconditional sequence prediction but offer little control over the biological function of what they generate.

The model's central idea is to treat genomic and transcriptomic sequences as a language that can be conditioned on functional semantics expressed in natural text. By prepending natural-language prompts describing species identity, gene names, and functional annotations to each training sequence, GenNA learns associations between functional semantics, species context, and the underlying sequence patterns. At inference time, this lets a user specify a desired biological context in plain language and sample nucleotide sequences consistent with it.

GenNA sits alongside generative genomic models such as DNAGPT, GENERator, and Evo, but differs in its explicit multimodal coupling of free-text functional guidance with sequence generation across a broad eukaryotic corpus rather than relying solely on special organism tokens or unconditional autoregression.

Key Features

Natural-language-guided generation: Text prompts describing species, gene name, and functional annotation are prepended to the sequence, allowing controllable nucleotide generation that ranges from open-ended to highly constrained tasks.
Multimodal nucleotide-and-text training: A joint training strategy lets the model learn sequence patterns while capturing relationships among sequence features, functional semantics, and species context.
Broad eukaryotic coverage: Pretraining spans 2,221 eukaryotic species across major lineages, supporting cross-species generation and generalization.
Genomic and transcriptomic scope: The corpus combines genomic and transcriptomic sequences, so the model spans both DNA-level and RNA-level (transcript) contexts.
Multiple model scales: The model is released in more than one size, including a smaller ~0.36B-parameter variant alongside the ~3.6B-parameter flagship, trading compute against capacity.

Technical Details

GenNA is a decoder-only (causal) transformer trained with a next-token prediction objective. The flagship model has approximately 3.6 billion parameters, with a smaller ~0.36B variant also reported. It uses character-level tokenization and a context window of roughly 20,000 raw characters, sufficient to cover full-length contexts for most transcripts and many genomic loci. Pretraining draws on a multimodal corpus covering 2,221 eukaryotic species and totaling approximately 416 billion characters, in which natural-language prompts (species identity, gene names, functional annotations) precede each nucleotide sequence so the model conditions generation on functional semantics. To evaluate whether generated and scored sequences respect biological constraints, the authors used perplexity-based analyses — for example, testing single-nucleotide substitutions and single-base deletions across untranslated regions (5′ and 3′ UTRs) of protein-coding genes in an independent validation set — and report that the model assigns higher perplexity to sequence–function mismatches, indicating sensitivity to functionally meaningful changes.

Applications

GenNA is aimed at researchers who want to design or interrogate nucleotide sequences with explicit functional or species context. Synthetic biologists can prompt the model in natural language to generate candidate regulatory elements, coding sequences, or transcripts conditioned on a target species and function, supporting design-of-experiments workflows. Because the model couples text with sequence, it can also be used to score how well a sequence matches a stated function, helping prioritize variants or flag function-disrupting mutations in UTRs and coding regions. Comparative and functional genomics groups benefit from the broad eukaryotic training corpus when working with species that are poorly represented in single-organism models.

Impact

GenNA extends the trend of adapting autoregressive language models to genomics by adding an explicit natural-language control channel over a large multi-species eukaryotic corpus, moving from representation and unconditional generation toward instruction-style, function-aware sequence design. As a recent preprint released under a CC BY license, its long-term adoption and downstream influence remain to be established, and independent benchmarking against established generative genomic models will be needed to position its capabilities. Practical limitations include the inherent character-level context ceiling of roughly 20,000 characters, which constrains very long genomic regions, and the dependence of conditional generation quality on the coverage and accuracy of the natural-language annotations available for a given species or gene.

Citation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Shen, Y., et al. (2026) GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations. bioRxiv.

DOI: 10.64898/2026.04.22.720063

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References43

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

16Closed

Usability — can I run it?15

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Natural-language-guided generation: Text prompts describing species, gene name, and functional annotation are prepended to the sequence, allowing controllable nucleotide generation that ranges from open-ended to highly constrained tasks.

Multimodal nucleotide-and-text training: A joint training strategy lets the model learn sequence patterns while capturing relationships among sequence features, functional semantics, and species context.

Broad eukaryotic coverage: Pretraining spans 2,221 eukaryotic species across major lineages, supporting cross-species generation and generalization.

Genomic and transcriptomic scope: The corpus combines genomic and transcriptomic sequences, so the model spans both DNA-level and RNA-level (transcript) contexts.

Multiple model scales: The model is released in more than one size, including a smaller ~0.36B-parameter variant alongside the ~3.6B-parameter flagship, trading compute against capacity.

Technical Details

Applications

Impact

GenNA

Key Features

Technical Details

Applications

Impact

Citation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

GenNA

Key Features

Technical Details

Applications

Impact

Citation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

GenNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

GenNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact