Carbon

Hugging Face / Beijing Zhongguancun Academy / TIGEM

Autoregressive DNA foundation model for variant effect prediction, using 6-mer tokenization to match Evo2-7B win rates at far higher throughput.

Released: May 2026

Parameters: 8 Billion

Carbon (the Carbon Genomic Foundation Model) is an open autoregressive DNA foundation model released in 2026 through a collaboration between Hugging Face, the Zhongguancun Academy in Beijing, and TIGEM (Telethon Institute of Genetics and Medicine) at Università di Napoli Federico II. It is built to model genomic sequence at scale while remaining fully open — model weights, training and inference code, the training corpus, and a technical report are all released under Apache 2.0.

Carbon enters a landscape defined by large genomic models such as Evo 2, and positions itself explicitly as an efficient, fully reproducible alternative. Where many high-performing genomic models rely on specialized hybrid architectures or single-nucleotide tokenization, Carbon takes a deliberately conventional decoder-only transformer and pairs it with a 6-mer DNA tokenizer that compresses sequence aggressively. The result is a model family that matches the zero-shot quality of larger systems while running at dramatically higher throughput — the 3B variant reportedly matches the win rate of Evo2-7B at roughly 275x the throughput, and can score the entire human genome on a single GPU in under two days.

The family ships in three sizes: Carbon-500M (a 0.5B draft model intended for speculative decoding), Carbon-3B (the 3B flagship), and Carbon-8B (the 8B largest variant). All are trained without task-specific labels, learning genomic structure directly from a curated, annotation-aware mixture of sequence data.

Key Features

Hybrid English + DNA tokenizer: A BPE tokenizer for English text is combined with a 6-mer DNA tokenizer (4,096 six-base tokens plus metadata tokens, ~155,776 total vocabulary), switched mid-sequence via a <dna> tag. The 6-mer scheme compresses DNA roughly 6x and cuts attention cost by up to 36x.
Deliberately vanilla architecture: A decoder-only transformer using RMSNorm, SwiGLU, RoPE, grouped-query attention, and tied input/output embeddings — chosen for reproducibility and efficiency rather than novelty.
Annotation-aware training corpus: Trained on the Carbon Pretraining Corpus, a curated mixture biased toward functional genomic regions rather than raw background sequence.
Long-context inference: Trained at 8k tokens (~49 kbp) and extended to 32k (~197 kbp); at inference Carbon-8B reaches ~786 kbp via YaRN 4x extrapolation (Carbon-3B ~2x).
Base-level scoring and generation: A marginalization procedure recovers per-base probabilities from the 6-mer model, enabling single-nucleotide variant scoring and generation despite the coarse tokenization.
Efficiency at scale: Carbon-3B matches Evo2-7B win rates at ~275x throughput, processing the full human genome on one GPU in under two days.

Technical Details

Carbon is a decoder-only autoregressive transformer trained on approximately 1 trillion tokens (~6 trillion DNA base pairs) drawn from the Carbon Pretraining Corpus. The corpus is an annotation-aware mixture: eukaryotic functional genomic regions from RefSeq, spliced mRNA transcripts from OpenGenome2, and roughly 10% prokaryotic genomes from GTDB v220 and IMG/PR. Training emphasizes functional regions over genome background. Two training-time innovations support the coarse tokenizer: a Factorized Nucleotide Supervision (FNS) loss that grants partial credit for near-miss 6-mer predictions, and a base-level inference scheme that marginalizes over tokens to produce per-base probabilities for scoring and generation.

The three variants (500M, 3B, 8B) share the same architectural template — RMSNorm, SwiGLU, RoPE, grouped-query attention, tied embeddings — differing in depth and width. Across eight training-free (zero-shot) tasks, Carbon-8B leads, while Carbon-3B matches Evo2-7B quality at far lower cost. Evaluated tasks include variant effect prediction (BRCA2 variants, ClinVar non-coding variants), sequence recovery, perturbation analyses (CAG/triplet repeat expansion, synonymous-codon substitution), and long-context retrieval (Genome-NIAH at 393 kbp).

Applications

Carbon supports a broad range of training-free genomic analyses without fine-tuning. Clinical and functional genomics researchers can use its zero-shot variant scoring to prioritize coding and non-coding variants of uncertain significance, including BRCA2 and ClinVar non-coding cases. Its perturbation-analysis capabilities suit studies of repeat expansion disorders and synonymous-codon effects, while long-context retrieval enables reasoning over large regulatory regions and multi-gene loci. The 500M draft model is intended to accelerate inference through speculative decoding, making genome-scale scoring practical on modest hardware. Because the full corpus, weights, and code are open under Apache 2.0, Carbon is well-suited as a reproducible backbone for downstream research and fine-tuning.

Impact

Carbon's central contribution is demonstrating that a conventional transformer paired with an efficient 6-mer tokenizer can match the zero-shot performance of larger, architecturally specialized genomic models at a fraction of the inference cost. The reported 275x throughput advantage of Carbon-3B over Evo2-7B — and the ability to process the human genome on a single GPU in under two days — substantially lowers the barrier to genome-scale analysis. Its fully open release of weights, code, and the Carbon Pretraining Corpus makes it one of the more transparent genomic foundation models available. Notable caveats: the work is so far described in a preprint ("Carbon: Decoding the Language of Life," bioRxiv 2026) rather than a peer-reviewed publication, and the model is DNA-sequence-only, so multi-modal genomic analyses still require complementary tools.

Citation

Carbon: Decoding the Language of Life

Allal, L. B., et al. (2026) Carbon: Decoding the Language of Life. bioRxiv.

DOI: 10.64898/2026.05.22.727119

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References73

GitHub

Stars199

Forks27

Open Issues3

Contributors7

Last Push1mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads9.5K

Likes56

Last Modified1mo ago

Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

93Open

Usability — can I run it?100

Reproducibility — can I retrain it?88

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model Demo Dataset

Key Features

Hybrid English + DNA tokenizer: A BPE tokenizer for English text is combined with a 6-mer DNA tokenizer (4,096 six-base tokens plus metadata tokens, ~155,776 total vocabulary), switched mid-sequence via a <dna> tag. The 6-mer scheme compresses DNA roughly 6x and cuts attention cost by up to 36x.

Deliberately vanilla architecture: A decoder-only transformer using RMSNorm, SwiGLU, RoPE, grouped-query attention, and tied input/output embeddings — chosen for reproducibility and efficiency rather than novelty.

Annotation-aware training corpus: Trained on the Carbon Pretraining Corpus, a curated mixture biased toward functional genomic regions rather than raw background sequence.

Long-context inference: Trained at 8k tokens (~49 kbp) and extended to 32k (~197 kbp); at inference Carbon-8B reaches ~786 kbp via YaRN 4x extrapolation (Carbon-3B ~2x).

Base-level scoring and generation: A marginalization procedure recovers per-base probabilities from the 6-mer model, enabling single-nucleotide variant scoring and generation despite the coarse tokenization.

Efficiency at scale: Carbon-3B matches Evo2-7B win rates at ~275x throughput, processing the full human genome on one GPU in under two days.

Technical Details

Applications

Impact

Carbon

Key Features

Technical Details

Applications

Impact

Citation

Carbon: Decoding the Language of Life

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Carbon

Key Features

Technical Details

Applications

Impact

Citation

Carbon: Decoding the Language of Life

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Carbon

#Key Features

#Technical Details

#Applications

#Impact

Citation

Carbon: Decoding the Language of Life

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Carbon

#Key Features

#Technical Details

#Applications

#Impact

Citation

Carbon: Decoding the Language of Life

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact