Specificity Foundation Model that predicts transcription factor-DNA binding specificity from sequence using a physics-derived dual-encoder with symmetric contrastive learning.
tf-SFM is a Specificity Foundation Model (SFM) for predicting transcription factor (TF)–DNA binding specificity directly from sequence. Determining which TF recognizes which genomic sequence is fundamental to understanding gene regulation, and conventional approaches rely on experimental assays such as protein binding microarrays, SELEX, or ChIP-seq. tf-SFM instead frames TF–DNA recognition as a cross-modal matching problem, learning to align cognate protein–DNA pairs in a shared representation space so that likely binding partners can be scored and retrieved from sequence alone.
Developed by the Reddy lab at ETH Zurich and posted as a bioRxiv preprint in June 2026, tf-SFM is one of six models in the SFM family, all built on a single, physics-derived dual-encoder architecture. It is the sequel to CALM-1, the antibody–antigen specificity model from the same group, generalizing that contrastive molecular-recognition recipe from immune binding to gene-regulatory binding.
The model encodes TF and DNA sequences with separate encoders and aligns them using a symmetric contrastive objective, pulling true binding pairs together and pushing non-binders apart. This formulation lets tf-SFM transfer knowledge learned across many regulatory contexts into zero-shot predictions on held-out TFs and binding sites.
tf-SFM uses the shared SFM architecture: a physics-derived dual-encoder trained with a symmetric contrastive objective and a learned Boltzmann temperature that calibrates similarity scores. The two encoders embed transcription factor and DNA sequences independently, and the contrastive loss aligns cognate pairs while separating mismatches. The model is pretrained on public TF–DNA specificity data and evaluated by zero-shot cross-modal retrieval on held-out pairs, where it reports strong top-k retrieval performance—mirroring the benchmarks used across the SFM family for measuring how reliably a model recovers true binding partners.
tf-SFM is aimed at regulatory genomics, where identifying or prioritizing TF–DNA interactions from sequence can accelerate the annotation of regulatory elements and the design of synthetic promoters. By scoring and retrieving likely binding partners, it can help generate hypotheses about which factors drive a given regulatory site, triage candidate binding sites for a TF of interest, and complement experimental assays in settings where wet-lab characterization is limited.
tf-SFM extends the contrastive specificity-prediction paradigm established by CALM-1 from antibody–antigen recognition to transcription factor–DNA recognition, demonstrating that a single physics-derived dual-encoder recipe transfers across molecular domains. As one of six SFMs released together, it contributes evidence that cross-modal contrastive learning is a general tool for biological specificity prediction. Its main current limitations are those of a recent preprint: results await peer review and independent benchmarking, and at the time of release no public code or weights repository was available, so reproduction depends on forthcoming artifact releases.
Reddy, S. T. (2026) Vibe Coding Specificity Foundation Models. bioRxiv.
DOI: 10.64898/2026.06.04.730134Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data