Gengram

Retrieval-augmented genomic foundation model that gives transformer backbones a hash-based k-mer motif memory for functional genomics tasks.

Released: January 2026

Parameters: 10 Billion

Genomic foundation models (GFMs) learn DNA "grammar" implicitly, spending dense neural computation to approximate the recurring short motifs — transcription-factor binding sites, splice signals, codon patterns — that are the basic vocabulary of functional genomics. Gengram, released by Zhejiang Lab in January 2026, takes a different stance: rather than forcing the network to rediscover these motifs in its weights, it equips the model with an explicit, retrievable memory of short k-mers. The accompanying preprint, "Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram," frames this as giving the model a lookup primitive for genomic "syntax."

Concretely, Gengram is a conditional memory module that stores embeddings for multi-base motifs (k-mers of length 1 to 6) in hash-indexed tables and retrieves them during the forward pass via a genomic-specific hashing scheme. Because the lookup is a hash table access rather than additional attention, it adds explicit motif knowledge with linear time complexity and minimal overhead. The module is architecture-agnostic and slots into existing transformer backbones using multi-head attention (MHA), grouped-query attention (GQA), or multi-head latent attention (MLA).

The flagship release, Gengram-10B, is a 10-billion-parameter model (about 2.87B activated parameters under a mixture-of-experts design) pretrained on roughly 300B tokens. It is distinguished from other large genomic models — including Zhejiang Lab's own Genos and the Evo2 family — by treating motif memory, not just scale, as a first-class modeling element. Weights and code are released openly under Apache 2.0.

Key Features

Explicit motif memory: Hash-indexed lookup tables store learned embeddings for k-mers of length 1 to 6, letting the model directly reference conserved motifs instead of approximating them in dense weights.
Architecture-agnostic module: Gengram integrates into transformer backbones built on MHA, GQA, or MLA attention, so it can augment a range of existing GFMs rather than requiring a bespoke architecture.
Efficient retrieval: The genomic-specific hashing scheme adds motif knowledge with linear time complexity and low computational overhead, avoiding the cost of extra attention layers.
Biologically structured memory: A 21-bp window aligned to DNA helical structure drives local aggregation, and reverse-complement symmetry in the memory embeddings adds interpretability and respects strand symmetry.
Open weights: Gengram-10B and its code are released under the permissive Apache 2.0 license on Hugging Face and GitHub.

Technical Details

Gengram-10B uses a mixture-of-experts transformer with roughly 2.87B activated parameters out of a 10B total, pretrained on about 300B tokens drawn from human and reference assemblies including HPRC Release 2, GRCh38, CHM13, and NCBI RefSeq, with sequencing data contributed by CycloneSEQ. The Gengram memory module improves MoE load balancing in addition to supplying motif lookups. Integrating Gengram into state-of-the-art GFMs yields substantial gains — up to 14% — across several functional genomics tasks. On reported benchmarks, Gengram-10B reaches 0.9832 on multi-species exon classification, 0.9009 on splice-site identification, and 0.7714 on Human OCR (Ensembl), evaluated against baselines such as Genos-10B and Evo2-40B.

Applications

Gengram targets functional genomics workflows where short sequence motifs carry disproportionate signal: predicting splice sites and exon structure, scoring open chromatin and regulatory elements, and more generally embedding DNA for downstream classifiers. Because the memory module is architecture-agnostic, model developers can graft it onto existing transformer-based GFMs to recover motif-level accuracy without redesigning the backbone, and the open Apache-2.0 release makes it directly usable by genomics labs and method developers for fine-tuning and benchmarking.

Impact

Gengram contributes to a broader shift in genomic AI away from pure scaling toward explicit, retrievable biological priors, echoing retrieval-augmentation trends in language modeling. By showing that a lightweight hash-based motif memory can deliver double-digit gains and even match or approach much larger models like Evo2-40B on specific tasks, it argues that smart inductive structure can substitute for raw parameter count. The permissive open-weights release lowers the barrier for other groups to adopt or extend the approach. As an arXiv preprint, its results await peer review, and reported benchmarks are concentrated on the authors' chosen functional genomics suite, so broader independent evaluation will determine how generally the motif-memory advantage holds.

Citation

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Preprint

Xu, H., et al. (2026) Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram. arXiv.org.

DOI: 10.48550/arXiv.2601.22203

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References36

GitHub

Stars51

Forks4

Open Issues0

Contributors2

Last Push4mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads0

Likes2

Last Modified5mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

83Open

Usability — can I run it?100

Reproducibility — can I retrain it?62

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Explicit motif memory: Hash-indexed lookup tables store learned embeddings for k-mers of length 1 to 6, letting the model directly reference conserved motifs instead of approximating them in dense weights.

Architecture-agnostic module: Gengram integrates into transformer backbones built on MHA, GQA, or MLA attention, so it can augment a range of existing GFMs rather than requiring a bespoke architecture.

Efficient retrieval: The genomic-specific hashing scheme adds motif knowledge with linear time complexity and low computational overhead, avoiding the cost of extra attention layers.

Biologically structured memory: A 21-bp window aligned to DNA helical structure drives local aggregation, and reverse-complement symmetry in the memory embeddings adds interpretability and respects strand symmetry.

Open weights: Gengram-10B and its code are released under the permissive Apache 2.0 license on Hugging Face and GitHub.

Technical Details

Applications

Impact

Gengram

Key Features

Technical Details

Applications

Impact

Citation

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Gengram

Key Features

Technical Details

Applications

Impact

Citation

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Gengram

#Key Features

#Technical Details

#Applications

#Impact

Citation

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Gengram

#Key Features

#Technical Details

#Applications

#Impact

Citation

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact