A multimodal architecture that couples pretrained DNA, RNA, and protein language models via directional cross-attention following the central dogma to form a unified Virtual Cell Embedding.
The Central Dogma Transformer (CDT) is a mechanism-oriented architecture that tries to model cellular information flow the way molecular biology describes it: DNA is transcribed into RNA, and RNA is translated into protein. Rather than training a single monolithic sequence model, CDT integrates three separate pretrained language models—one each for DNA, RNA, and protein—and connects them with directional cross-attention modules that mirror the central dogma. DNA-to- RNA attention is intended to capture transcriptional regulation, while RNA-to- protein attention captures translational relationships, and the combined signal is distilled into a unified representation the author calls a Virtual Cell Embedding.
CDT was developed and released as a single-author preprint by Nobuyuki Ota in January 2026. It is explicitly framed as a proof-of-concept ("CDT v1") and a step toward mechanism-oriented AI for cellular understanding, rather than a production foundation model. The design philosophy contrasts with purely data-driven multimodal models by hard-wiring the directionality of the central dogma into the attention structure, which the author argues yields more interpretable, biologically grounded representations.
The work sits at the intersection of genomic, transcriptomic, and proteomic language modeling, and is positioned as a bridge between single-modality foundation models (such as DNA, RNA, and protein language models) and the emerging goal of integrated "virtual cell" representations.
CDT is a transformer-based multimodal model that wires together three frozen pretrained language models with trainable directional cross-attention layers. In the v1 proof of concept, the RNA and protein embeddings are fixed rather than cell-state-specific, so the learned coupling is concentrated in the cross- attention connectors. The model was validated on CRISPRi enhancer perturbation data from K562 cells, where it predicted perturbation effects with a Pearson correlation of 0.503—about 63% of an estimated theoretical ceiling of r = 0.797 set by cross-experiment variability. Interpretability analyses combined attention inspection with gradient attribution; the gradient analysis surfaced a CTCF binding site that was consistent with Hi-C chromatin contact evidence, supporting the claim that the architecture captures biologically meaningful regulatory signal.
CDT is aimed at researchers interested in modeling regulatory information flow across DNA, RNA, and protein within a single framework, particularly for predicting the effects of genomic perturbations such as enhancer CRISPRi screens. Its Virtual Cell Embedding could serve as a feature representation for downstream functional genomics tasks, and its interpretability tooling makes it useful for hypothesis generation about transcriptional regulation, for example locating candidate regulatory elements like CTCF sites. As a v1 prototype it is best suited to methodological exploration rather than turnkey deployment.
CDT contributes a biologically structured alternative to generic multimodal fusion by encoding the directionality of the central dogma directly into model attention. Its early validation on K562 enhancer perturbation data and its interpretability results are promising signals for mechanism-oriented modeling of the cell. However, the work is a single-author preprint with a clearly stated proof-of-concept scope, fixed non-cell-specific embeddings in v1, and no public code, weights, or license located at the time of writing—so its broader adoption and influence remain to be demonstrated.