Contrastive promoter-protein pretraining that aligns bacterial promoters with their encoded proteins to learn regulatory genomics representations.
C3P (Contrastive Promoter-Protein Pretraining) is a self-supervised genome model that learns representations of bacterial regulatory DNA by aligning each promoter with the protein it encodes. Introduced in a May 2026 preprint by Cameron Dufault, Scott Xu, and Alan M. Moses at the University of Toronto (Departments of Computer Science and Cell and Systems Biology), it adapts the CLIP contrastive-learning recipe from vision-language modelling to genomics: a promoter sequence and its downstream coding protein form a positive pair, and the model is trained to embed true pairs close together while pushing mismatched pairs apart.
The work targets a known weakness of conventional genome language models (gLMs). Most gLMs are trained with masked or autoregressive reconstruction over nucleotide sequence, an objective that struggles on promoters because regulatory DNA is short, noisy, and rapidly evolving, with weak local sequence conservation. By using the comparatively well-conserved protein product as a supervisory signal, C3P sidesteps reconstruction and instead asks what regulatory context tends to accompany a given protein function. This yields promoter embeddings that capture aspects of gene regulation rather than raw sequence statistics.
C3P fits into the regulatory-genomics niche alongside reconstruction-based gLMs such as Nucleotide Transformer and Evo, but distinguishes itself through its cross-modal contrastive objective and its focus on bacterial promoters. The authors report competitive performance at a fraction of the training cost of leading gLMs, positioning C3P as an efficient, specialized alternative for prokaryotic gene-regulation tasks.
C3P uses a dual-encoder architecture. The promoter encoder is a custom transformer (default configuration of 4 layers and 4 attention heads, processing up to 300 bp of upstream non-coding sequence), and the protein encoder is the ESM2 protein language model; both modalities are projected into a shared 128-dimensional embedding space where the contrastive loss is applied. Training data consists of approximately 88 million bacterial promoter-protein pairs extracted from RefSeq genomes, using taxonomically diverse genomes filtered at the genus and species level and protein clustering (via MMseqs2) to control redundancy. The released checkpoint is labelled C3P_100M, indicating roughly 100M parameters; the preprint does not report an exact count. On regulatory-annotation benchmarks the authors report multi-fold gains over leading gLMs, and on zero-shot co-regulated gene retrieval C3P shows substantial improvements where reconstruction-based gLMs do not, all at markedly lower training cost.
C3P is most useful to microbial genomicists and synthetic-biology researchers working with bacterial regulation. Its promoter embeddings can annotate regulatory elements, predict aspects of gene regulation, and group genes likely under shared control, supporting tasks such as operon and regulon inference, functional annotation of uncharacterized genomes, and prioritizing candidate regulatory sequences for experimental follow-up. Because co-regulated gene retrieval works zero-shot, researchers can mine large genome collections for regulatory relationships without curated training labels, and the embeddings can serve as input features for downstream supervised models in metagenomic and comparative-genomics pipelines.
C3P offers evidence that cross-modal contrastive learning is a viable and efficient alternative to reconstruction-based pretraining for regulatory DNA, a domain where standard gLMs have underperformed. By extracting supervision from the conserved protein product rather than the variable promoter itself, it reframes how genome models can learn about gene regulation and reports doing so at a fraction of the training cost of leading gLMs. As a recent preprint (not yet peer-reviewed), its benchmark claims await independent replication, and its scope is presently limited to bacterial promoters; the published Hugging Face model card is currently a stub ("better model card to come soon"), with the more complete documentation living in the GitHub README. Open weights and an end-to-end pipeline lower the barrier for the community to test, extend, and scale the approach.