Hugging Face / Beijing Zhongguancun Academy / TIGEM
An open autoregressive genomic foundation model (0.5B–8B params) with a 6-mer DNA tokenizer, matching Evo2-7B win rates at far higher throughput.
Carbon (the Carbon Genomic Foundation Model) is an open autoregressive DNA foundation model released in 2026 through a collaboration between Hugging Face, the Zhongguancun Academy in Beijing, and TIGEM (Telethon Institute of Genetics and Medicine) at Università di Napoli Federico II. It is built to model genomic sequence at scale while remaining fully open — model weights, training and inference code, the training corpus, and a technical report are all released under Apache 2.0.
Carbon enters a landscape defined by large genomic models such as Evo 2, and positions itself explicitly as an efficient, fully reproducible alternative. Where many high-performing genomic models rely on specialized hybrid architectures or single-nucleotide tokenization, Carbon takes a deliberately conventional decoder-only transformer and pairs it with a 6-mer DNA tokenizer that compresses sequence aggressively. The result is a model family that matches the zero-shot quality of larger systems while running at dramatically higher throughput — the 3B variant reportedly matches the win rate of Evo2-7B at roughly 275x the throughput, and can score the entire human genome on a single GPU in under two days.
The family ships in three sizes: Carbon-500M (a 0.5B draft model intended for speculative decoding), Carbon-3B (the 3B flagship), and Carbon-8B (the 8B largest variant). All are trained without task-specific labels, learning genomic structure directly from a curated, annotation-aware mixture of sequence data.
<dna> tag. The 6-mer scheme compresses DNA roughly 6x and cuts attention cost by up to 36x.Carbon is a decoder-only autoregressive transformer trained on approximately 1 trillion tokens (~6 trillion DNA base pairs) drawn from the Carbon Pretraining Corpus. The corpus is an annotation-aware mixture: eukaryotic functional genomic regions from RefSeq, spliced mRNA transcripts from OpenGenome2, and roughly 10% prokaryotic genomes from GTDB v220 and IMG/PR. Training emphasizes functional regions over genome background. Two training-time innovations support the coarse tokenizer: a Factorized Nucleotide Supervision (FNS) loss that grants partial credit for near-miss 6-mer predictions, and a base-level inference scheme that marginalizes over tokens to produce per-base probabilities for scoring and generation.
The three variants (500M, 3B, 8B) share the same architectural template — RMSNorm, SwiGLU, RoPE, grouped-query attention, tied embeddings — differing in depth and width. Across eight training-free (zero-shot) tasks, Carbon-8B leads, while Carbon-3B matches Evo2-7B quality at far lower cost. Evaluated tasks include variant effect prediction (BRCA2 variants, ClinVar non-coding variants), sequence recovery, perturbation analyses (CAG/triplet repeat expansion, synonymous-codon substitution), and long-context retrieval (Genome-NIAH at 393 kbp).
Carbon supports a broad range of training-free genomic analyses without fine-tuning. Clinical and functional genomics researchers can use its zero-shot variant scoring to prioritize coding and non-coding variants of uncertain significance, including BRCA2 and ClinVar non-coding cases. Its perturbation-analysis capabilities suit studies of repeat expansion disorders and synonymous-codon effects, while long-context retrieval enables reasoning over large regulatory regions and multi-gene loci. The 500M draft model is intended to accelerate inference through speculative decoding, making genome-scale scoring practical on modest hardware. Because the full corpus, weights, and code are open under Apache 2.0, Carbon is well-suited as a reproducible backbone for downstream research and fine-tuning.
Carbon's central contribution is demonstrating that a conventional transformer paired with an efficient 6-mer tokenizer can match the zero-shot performance of larger, architecturally specialized genomic models at a fraction of the inference cost. The reported 275x throughput advantage of Carbon-3B over Evo2-7B — and the ability to process the human genome on a single GPU in under two days — substantially lowers the barrier to genome-scale analysis. Its fully open release of weights, code, and the Carbon Pretraining Corpus makes it one of the more transparent genomic foundation models available. Notable caveats: the work is so far described in a preprint ("Carbon: Decoding the Language of Life," bioRxiv 2026) rather than a peer-reviewed publication, and the model is DNA-sequence-only, so multi-modal genomic analyses still require complementary tools.
Allal, L. B., et al. (2026) Carbon: Decoding the Language of Life. bioRxiv.
DOI: 10.64898/2026.05.22.727119