bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

Gengram

Zhejiang Lab

Retrieval-augmented genomic foundation model that adds an explicit hash-based k-mer motif memory to transformer backbones, gaining up to 14% on functional genomics tasks.

Released: January 2026
Parameters: 10 Billion

Genomic foundation models (GFMs) learn DNA "grammar" implicitly, spending dense neural computation to approximate the recurring short motifs — transcription-factor binding sites, splice signals, codon patterns — that are the basic vocabulary of functional genomics. Gengram, released by Zhejiang Lab in January 2026, takes a different stance: rather than forcing the network to rediscover these motifs in its weights, it equips the model with an explicit, retrievable memory of short k-mers. The accompanying preprint, "Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram," frames this as giving the model a lookup primitive for genomic "syntax."

Concretely, Gengram is a conditional memory module that stores embeddings for multi-base motifs (k-mers of length 1 to 6) in hash-indexed tables and retrieves them during the forward pass via a genomic-specific hashing scheme. Because the lookup is a hash table access rather than additional attention, it adds explicit motif knowledge with linear time complexity and minimal overhead. The module is architecture-agnostic and slots into existing transformer backbones using multi-head attention (MHA), grouped-query attention (GQA), or multi-head latent attention (MLA).

The flagship release, Gengram-10B, is a 10-billion-parameter model (about 2.87B activated parameters under a mixture-of-experts design) pretrained on roughly 300B tokens. It is distinguished from other large genomic models — including Zhejiang Lab's own Genos and the Evo2 family — by treating motif memory, not just scale, as a first-class modeling element. Weights and code are released openly under Apache 2.0.

#Key Features

  • Explicit motif memory: Hash-indexed lookup tables store learned embeddings for k-mers of length 1 to 6, letting the model directly reference conserved motifs instead of approximating them in dense weights.
  • Architecture-agnostic module: Gengram integrates into transformer backbones built on MHA, GQA, or MLA attention, so it can augment a range of existing GFMs rather than requiring a bespoke architecture.
  • Efficient retrieval: The genomic-specific hashing scheme adds motif knowledge with linear time complexity and low computational overhead, avoiding the cost of extra attention layers.
  • Biologically structured memory: A 21-bp window aligned to DNA helical structure drives local aggregation, and reverse-complement symmetry in the memory embeddings adds interpretability and respects strand symmetry.
  • Open weights: Gengram-10B and its code are released under the permissive Apache 2.0 license on Hugging Face and GitHub.

#Technical Details

Gengram-10B uses a mixture-of-experts transformer with roughly 2.87B activated parameters out of a 10B total, pretrained on about 300B tokens drawn from human and reference assemblies including HPRC Release 2, GRCh38, CHM13, and NCBI RefSeq, with sequencing data contributed by CycloneSEQ. The Gengram memory module improves MoE load balancing in addition to supplying motif lookups. Integrating Gengram into state-of-the-art GFMs yields substantial gains — up to 14% — across several functional genomics tasks. On reported benchmarks, Gengram-10B reaches 0.9832 on multi-species exon classification, 0.9009 on splice-site identification, and 0.7714 on Human OCR (Ensembl), evaluated against baselines such as Genos-10B and Evo2-40B.

#Applications

Gengram targets functional genomics workflows where short sequence motifs carry disproportionate signal: predicting splice sites and exon structure, scoring open chromatin and regulatory elements, and more generally embedding DNA for downstream classifiers. Because the memory module is architecture-agnostic, model developers can graft it onto existing transformer-based GFMs to recover motif-level accuracy without redesigning the backbone, and the open Apache-2.0 release makes it directly usable by genomics labs and method developers for fine-tuning and benchmarking.

#Impact

Gengram contributes to a broader shift in genomic AI away from pure scaling toward explicit, retrievable biological priors, echoing retrieval-augmentation trends in language modeling. By showing that a lightweight hash-based motif memory can deliver double-digit gains and even match or approach much larger models like Evo2-40B on specific tasks, it argues that smart inductive structure can substitute for raw parameter count. The permissive open-weights release lowers the barrier for other groups to adopt or extend the approach. As an arXiv preprint, its results await peer review, and reported benchmarks are concentrated on the authors' chosen functional genomics suite, so broader independent evaluation will determine how generally the motif-memory advantage holds.

GitHub

Stars48
Forks4
Open Issues0
Contributors2
Last Push2mo ago
LanguagePython
LicenseApache-2.0

HuggingFace

Downloads0
Likes2
Last Modified4mo ago

Openness

bio.rodeo opennessFully open · usable and reproducible
83Open
Usability — can I run it?100
Reproducibility — can I retrain it?62
Model Openness Framework
Unclassified
Missing required components

Tags

variant_effect_predictionregulatory_element_predictiontransformermixture_of_expertsfoundation_modelretrieval_augmentedgenomicsdna

Resources

GitHub RepositoryResearch PaperHuggingFace Model