Species-aware DNA embedding model built on DNABERT-2, using contrastive learning to cluster and differentiate genomic sequences by species without labeled data.
DNABERT-S is a DNA embedding model developed at the MAGICS Lab at Northwestern University, designed to generate sequence representations that naturally cluster by species in embedding space. It extends the DNABERT-2 genome foundation model with a specialized contrastive learning framework, enabling researchers to differentiate, classify, and bin genomic sequences without relying on labeled training data or reference genomes for every organism.
The core challenge DNABERT-S addresses is a fundamental misalignment between how general-purpose DNA language models are trained and how they are used in practice. Pre-training objectives like masked language modeling encourage models to learn sequence-level syntax and grammar, but they do not explicitly push representations of sequences from different species apart. As a result, embedding spaces produced by models such as DNABERT-2 or HyenaDNA are not well-organized by species — a critical shortcoming for tasks like metagenomics binning, where the goal is to group millions of unlabeled sequencing reads by their organism of origin. DNABERT-S is purpose-built to close this gap.
The model was first posted as a preprint in February 2024 and subsequently published in Bioinformatics as part of the ISMB 2025 proceedings. It is developed by Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V. Davuluri, Zhong Wang, and Han Liu, spanning Northwestern University, UC Merced, and Lawrence Berkeley National Laboratory.
DNABERT-S is initialized from DNABERT-2, a 117-million parameter BERT-style transformer that uses Byte Pair Encoding (BPE) tokenization and Attention with Linear Biases (ALiBi) instead of fixed positional embeddings. This backbone was pre-trained on 32 billion bases from 135 species. DNABERT-S then applies contrastive fine-tuning on top of this foundation using the C2LR protocol, training for three epochs total (one phase-one epoch, two phase-two epochs) at a learning rate of 3e-6, batch size 48, and SimCLR temperature of 0.05. Training ran for approximately 48 hours on 8 NVIDIA A100 80 GB GPUs.
The training corpus consists of 2 million paired 10,000-bp sequences sampled from 24,951 reference genomes: approximately 5,000 viral genomes, 5,011 fungal genomes, and 6,402 bacterial genomes. Positive pairs are two windows drawn from the same genome; negative pairs are drawn from different species, with hard negatives mined during phase-one training. Evaluation spans 28 diverse datasets covering clustering, binning, and classification scenarios, including 14 long-read datasets from the CAMI2 benchmark (marine and plant-associated environments) and 9 synthetic datasets at varying species richness.
Across K-Means clustering benchmarks, DNABERT-S achieves a mean Adjusted Rand Index (ARI) of 53.80, compared to 26.47 for the strongest baseline — approximately a 2x improvement. In metagenomics binning on synthetic datasets, the model recovers over 80% of species with an F1 score above 0.5, versus roughly 40% of species on the more realistic CAMI2 marine datasets, consistently outperforming tetranucleotide frequency (TNF) and other embedding baselines by more than one-fold.
DNABERT-S is particularly well-suited for metagenomics workflows, where shotgun sequencing of environmental samples (soil, ocean water, gut microbiomes) produces millions of reads from hundreds of co-occurring organisms that must be separated before downstream analysis. By replacing fragile reference-alignment steps with embedding-based clustering, DNABERT-S enables binning even in the absence of characterized reference genomes for every species in the sample. Beyond metagenomics, the model supports ecological diversity surveys, pathogen surveillance from clinical or environmental sequencing, and evolutionary analysis requiring species-level separation of unlabeled genomic fragments. Its strong few-shot classification performance also makes it useful for rapid species identification in resource-constrained settings where generating large labeled training sets is impractical.
DNABERT-S represents a conceptually important step in the maturation of genomic foundation models: the recognition that pre-training objectives and downstream evaluation objectives can be fundamentally misaligned, and that targeted contrastive fine-tuning can bridge that gap without sacrificing the rich sequence representations learned during pre-training. Its publication in Bioinformatics at ISMB 2025 reflects community recognition of this contribution. The HuggingFace model has accumulated over 288,000 downloads, indicating substantial uptake in metagenomics and environmental genomics research communities. A notable limitation is that the current training data emphasizes bacteria, fungi, and viruses, leaving coverage of eukaryotic organisms — including plants and animals — less thorough. Performance also degrades on very short reads (below ~2,000 bp) and on highly repetitive genomic regions where species-level signal is weak. Future extensions may incorporate eukaryotic training data and adapt the contrastive framework to short-read sequencing platforms such as Illumina.
Zhou, Z., et al. (2024) DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings. Bioinform..
DOI: 10.1093/bioinformatics/btaf188