bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene

Carbon

Hugging Face / Beijing Zhongguancun Academy / TIGEM

An open autoregressive genomic foundation model (0.5B–8B params) with a 6-mer DNA tokenizer, matching Evo2-7B win rates at far higher throughput.

Released: May 2026
Parameters: 8 Billion

Carbon (the Carbon Genomic Foundation Model) is an open autoregressive DNA foundation model released in 2026 through a collaboration between Hugging Face, the Zhongguancun Academy in Beijing, and TIGEM (Telethon Institute of Genetics and Medicine) at Università di Napoli Federico II. It is built to model genomic sequence at scale while remaining fully open — model weights, training and inference code, the training corpus, and a technical report are all released under Apache 2.0.

Carbon enters a landscape defined by large genomic models such as Evo 2, and positions itself explicitly as an efficient, fully reproducible alternative. Where many high-performing genomic models rely on specialized hybrid architectures or single-nucleotide tokenization, Carbon takes a deliberately conventional decoder-only transformer and pairs it with a 6-mer DNA tokenizer that compresses sequence aggressively. The result is a model family that matches the zero-shot quality of larger systems while running at dramatically higher throughput — the 3B variant reportedly matches the win rate of Evo2-7B at roughly 275x the throughput, and can score the entire human genome on a single GPU in under two days.

The family ships in three sizes: Carbon-500M (a 0.5B draft model intended for speculative decoding), Carbon-3B (the 3B flagship), and Carbon-8B (the 8B largest variant). All are trained without task-specific labels, learning genomic structure directly from a curated, annotation-aware mixture of sequence data.

#Key Features

  • Hybrid English + DNA tokenizer: A BPE tokenizer for English text is combined with a 6-mer DNA tokenizer (4,096 six-base tokens plus metadata tokens, ~155,776 total vocabulary), switched mid-sequence via a <dna> tag. The 6-mer scheme compresses DNA roughly 6x and cuts attention cost by up to 36x.
  • Deliberately vanilla architecture: A decoder-only transformer using RMSNorm, SwiGLU, RoPE, grouped-query attention, and tied input/output embeddings — chosen for reproducibility and efficiency rather than novelty.
  • Annotation-aware training corpus: Trained on the Carbon Pretraining Corpus, a curated mixture biased toward functional genomic regions rather than raw background sequence.
  • Long-context inference: Trained at 8k tokens (~49 kbp) and extended to 32k (~197 kbp); at inference Carbon-8B reaches ~786 kbp via YaRN 4x extrapolation (Carbon-3B ~2x).
  • Base-level scoring and generation: A marginalization procedure recovers per-base probabilities from the 6-mer model, enabling single-nucleotide variant scoring and generation despite the coarse tokenization.
  • Efficiency at scale: Carbon-3B matches Evo2-7B win rates at ~275x throughput, processing the full human genome on one GPU in under two days.

#Technical Details

Carbon is a decoder-only autoregressive transformer trained on approximately 1 trillion tokens (~6 trillion DNA base pairs) drawn from the Carbon Pretraining Corpus. The corpus is an annotation-aware mixture: eukaryotic functional genomic regions from RefSeq, spliced mRNA transcripts from OpenGenome2, and roughly 10% prokaryotic genomes from GTDB v220 and IMG/PR. Training emphasizes functional regions over genome background. Two training-time innovations support the coarse tokenizer: a Factorized Nucleotide Supervision (FNS) loss that grants partial credit for near-miss 6-mer predictions, and a base-level inference scheme that marginalizes over tokens to produce per-base probabilities for scoring and generation.

The three variants (500M, 3B, 8B) share the same architectural template — RMSNorm, SwiGLU, RoPE, grouped-query attention, tied embeddings — differing in depth and width. Across eight training-free (zero-shot) tasks, Carbon-8B leads, while Carbon-3B matches Evo2-7B quality at far lower cost. Evaluated tasks include variant effect prediction (BRCA2 variants, ClinVar non-coding variants), sequence recovery, perturbation analyses (CAG/triplet repeat expansion, synonymous-codon substitution), and long-context retrieval (Genome-NIAH at 393 kbp).

#Applications

Carbon supports a broad range of training-free genomic analyses without fine-tuning. Clinical and functional genomics researchers can use its zero-shot variant scoring to prioritize coding and non-coding variants of uncertain significance, including BRCA2 and ClinVar non-coding cases. Its perturbation-analysis capabilities suit studies of repeat expansion disorders and synonymous-codon effects, while long-context retrieval enables reasoning over large regulatory regions and multi-gene loci. The 500M draft model is intended to accelerate inference through speculative decoding, making genome-scale scoring practical on modest hardware. Because the full corpus, weights, and code are open under Apache 2.0, Carbon is well-suited as a reproducible backbone for downstream research and fine-tuning.

#Impact

Carbon's central contribution is demonstrating that a conventional transformer paired with an efficient 6-mer tokenizer can match the zero-shot performance of larger, architecturally specialized genomic models at a fraction of the inference cost. The reported 275x throughput advantage of Carbon-3B over Evo2-7B — and the ability to process the human genome on a single GPU in under two days — substantially lowers the barrier to genome-scale analysis. Its fully open release of weights, code, and the Carbon Pretraining Corpus makes it one of the more transparent genomic foundation models available. Notable caveats: the work is so far described in a preprint ("Carbon: Decoding the Language of Life," bioRxiv 2026) rather than a peer-reviewed publication, and the model is DNA-sequence-only, so multi-modal genomic analyses still require complementary tools.

Citation

Carbon: Decoding the Language of Life

Allal, L. B., et al. (2026) Carbon: Decoding the Language of Life. bioRxiv.

DOI: 10.64898/2026.05.22.727119

GitHub

Stars185
Forks25

HuggingFace

Downloads6K
Likes48

Openness

Class II
Open Tooling

Tags

dnafoundation_modelgenerativegenomicstransformervariant_effect_prediction

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDemoDataset