Retrieval-augmented genomic foundation model that adds an explicit hash-based k-mer motif memory to transformer backbones, gaining up to 14% on functional genomics tasks.
Genomic foundation models (GFMs) learn DNA "grammar" implicitly, spending dense neural computation to approximate the recurring short motifs — transcription-factor binding sites, splice signals, codon patterns — that are the basic vocabulary of functional genomics. Gengram, released by Zhejiang Lab in January 2026, takes a different stance: rather than forcing the network to rediscover these motifs in its weights, it equips the model with an explicit, retrievable memory of short k-mers. The accompanying preprint, "Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram," frames this as giving the model a lookup primitive for genomic "syntax."
Concretely, Gengram is a conditional memory module that stores embeddings for multi-base motifs (k-mers of length 1 to 6) in hash-indexed tables and retrieves them during the forward pass via a genomic-specific hashing scheme. Because the lookup is a hash table access rather than additional attention, it adds explicit motif knowledge with linear time complexity and minimal overhead. The module is architecture-agnostic and slots into existing transformer backbones using multi-head attention (MHA), grouped-query attention (GQA), or multi-head latent attention (MLA).
The flagship release, Gengram-10B, is a 10-billion-parameter model (about 2.87B activated parameters under a mixture-of-experts design) pretrained on roughly 300B tokens. It is distinguished from other large genomic models — including Zhejiang Lab's own Genos and the Evo2 family — by treating motif memory, not just scale, as a first-class modeling element. Weights and code are released openly under Apache 2.0.
Gengram-10B uses a mixture-of-experts transformer with roughly 2.87B activated parameters out of a 10B total, pretrained on about 300B tokens drawn from human and reference assemblies including HPRC Release 2, GRCh38, CHM13, and NCBI RefSeq, with sequencing data contributed by CycloneSEQ. The Gengram memory module improves MoE load balancing in addition to supplying motif lookups. Integrating Gengram into state-of-the-art GFMs yields substantial gains — up to 14% — across several functional genomics tasks. On reported benchmarks, Gengram-10B reaches 0.9832 on multi-species exon classification, 0.9009 on splice-site identification, and 0.7714 on Human OCR (Ensembl), evaluated against baselines such as Genos-10B and Evo2-40B.
Gengram targets functional genomics workflows where short sequence motifs carry disproportionate signal: predicting splice sites and exon structure, scoring open chromatin and regulatory elements, and more generally embedding DNA for downstream classifiers. Because the memory module is architecture-agnostic, model developers can graft it onto existing transformer-based GFMs to recover motif-level accuracy without redesigning the backbone, and the open Apache-2.0 release makes it directly usable by genomics labs and method developers for fine-tuning and benchmarking.
Gengram contributes to a broader shift in genomic AI away from pure scaling toward explicit, retrievable biological priors, echoing retrieval-augmentation trends in language modeling. By showing that a lightweight hash-based motif memory can deliver double-digit gains and even match or approach much larger models like Evo2-40B on specific tasks, it argues that smart inductive structure can substitute for raw parameter count. The permissive open-weights release lowers the barrier for other groups to adopt or extend the approach. As an arXiv preprint, its results await peer review, and reported benchmarks are concentrated on the authors' chosen functional genomics suite, so broader independent evaluation will determine how generally the motif-memory advantage holds.