C3P

Contrastive promoter-protein pretraining that aligns bacterial promoters with their encoded proteins to learn regulatory genomics representations.

Released: May 2026

C3P (Contrastive Promoter-Protein Pretraining) is a self-supervised genome model that learns representations of bacterial regulatory DNA by aligning each promoter with the protein it encodes. Introduced in a May 2026 preprint by Cameron Dufault, Scott Xu, and Alan M. Moses at the University of Toronto (Departments of Computer Science and Cell and Systems Biology), it adapts the CLIP contrastive-learning recipe from vision-language modelling to genomics: a promoter sequence and its downstream coding protein form a positive pair, and the model is trained to embed true pairs close together while pushing mismatched pairs apart.

The work targets a known weakness of conventional genome language models (gLMs). Most gLMs are trained with masked or autoregressive reconstruction over nucleotide sequence, an objective that struggles on promoters because regulatory DNA is short, noisy, and rapidly evolving, with weak local sequence conservation. By using the comparatively well-conserved protein product as a supervisory signal, C3P sidesteps reconstruction and instead asks what regulatory context tends to accompany a given protein function. This yields promoter embeddings that capture aspects of gene regulation rather than raw sequence statistics.

C3P fits into the regulatory-genomics niche alongside reconstruction-based gLMs such as Nucleotide Transformer and Evo, but distinguishes itself through its cross-modal contrastive objective and its focus on bacterial promoters. The authors report competitive performance at a fraction of the training cost of leading gLMs, positioning C3P as an efficient, specialized alternative for prokaryotic gene-regulation tasks.

Key Features

Cross-modal contrastive objective: A CLIP-style loss aligns promoter sequences with their encoded protein representations, replacing nucleotide reconstruction with a biologically grounded supervisory signal that is more robust to the noisiness of regulatory DNA.
Reusable promoter embeddings: The trained promoter encoder produces general-purpose representations that transfer to downstream regulatory annotation tasks, where the authors report multi-fold improvements over leading gLMs.
Zero-shot co-regulated gene retrieval: Because embeddings reflect regulatory context, C3P can retrieve co-regulated genes without any task-specific labels or experimental data, a capability the authors find gLMs largely lack.
Training efficiency: C3P reaches strong performance at a fraction of the compute used by large gLMs, and scaling analyses suggest further gains are available with more data and parameters.
Open weights and code: A pretrained checkpoint (C3P_100M) is released under an MIT license on Hugging Face, with a full training and inference pipeline on GitHub.

Technical Details

C3P uses a dual-encoder architecture. The promoter encoder is a custom transformer (default configuration of 4 layers and 4 attention heads, processing up to 300 bp of upstream non-coding sequence), and the protein encoder is the ESM2 protein language model; both modalities are projected into a shared 128-dimensional embedding space where the contrastive loss is applied. Training data consists of approximately 88 million bacterial promoter-protein pairs extracted from RefSeq genomes, using taxonomically diverse genomes filtered at the genus and species level and protein clustering (via MMseqs2) to control redundancy. The released checkpoint is labelled C3P_100M, indicating roughly 100M parameters; the preprint does not report an exact count. On regulatory-annotation benchmarks the authors report multi-fold gains over leading gLMs, and on zero-shot co-regulated gene retrieval C3P shows substantial improvements where reconstruction-based gLMs do not, all at markedly lower training cost.

Applications

C3P is most useful to microbial genomicists and synthetic-biology researchers working with bacterial regulation. Its promoter embeddings can annotate regulatory elements, predict aspects of gene regulation, and group genes likely under shared control, supporting tasks such as operon and regulon inference, functional annotation of uncharacterized genomes, and prioritizing candidate regulatory sequences for experimental follow-up. Because co-regulated gene retrieval works zero-shot, researchers can mine large genome collections for regulatory relationships without curated training labels, and the embeddings can serve as input features for downstream supervised models in metagenomic and comparative-genomics pipelines.

Impact

C3P offers evidence that cross-modal contrastive learning is a viable and efficient alternative to reconstruction-based pretraining for regulatory DNA, a domain where standard gLMs have underperformed. By extracting supervision from the conserved protein product rather than the variable promoter itself, it reframes how genome models can learn about gene regulation and reports doing so at a fraction of the training cost of leading gLMs. As a recent preprint (not yet peer-reviewed), its benchmark claims await independent replication, and its scope is presently limited to bacterial promoters; the published Hugging Face model card is currently a stub ("better model card to come soon"), with the more complete documentation living in the GitHub README. Open weights and an end-to-end pipeline lower the barrier for the community to test, extend, and scale the approach.

Citation

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Preprint

Dufault, C., et al. (2026) C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation.

DOI: 10.48550/arXiv.2605.25242

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References55

GitHub

Stars1

Forks1

Open Issues0

Contributors1

Last Push1mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified1mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

77Open

Usability — can I run it?95

Reproducibility — can I retrain it?58

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Cross-modal contrastive objective: A CLIP-style loss aligns promoter sequences with their encoded protein representations, replacing nucleotide reconstruction with a biologically grounded supervisory signal that is more robust to the noisiness of regulatory DNA.

Reusable promoter embeddings: The trained promoter encoder produces general-purpose representations that transfer to downstream regulatory annotation tasks, where the authors report multi-fold improvements over leading gLMs.

Zero-shot co-regulated gene retrieval: Because embeddings reflect regulatory context, C3P can retrieve co-regulated genes without any task-specific labels or experimental data, a capability the authors find gLMs largely lack.

Training efficiency: C3P reaches strong performance at a fraction of the compute used by large gLMs, and scaling analyses suggest further gains are available with more data and parameters.

Open weights and code: A pretrained checkpoint (C3P_100M) is released under an MIT license on Hugging Face, with a full training and inference pipeline on GitHub.

Technical Details

Applications

Impact

C3P

Key Features

Technical Details

Applications

Impact

Citation

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

C3P

Key Features

Technical Details

Applications

Impact

Citation

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

C3P

#Key Features

#Technical Details

#Applications

#Impact

Citation

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

C3P

#Key Features

#Technical Details

#Applications

#Impact

Citation

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact