Multimodal deep-learning framework that detects and localizes DNA lesions directly from native nanopore sequencing, built on the damage-aware LesionBERT foundation model.
DNA lesions—chemically altered or damaged bases arising from oxidation, alkylation, UV exposure, and other insults—are central to mutagenesis, aging, and disease, yet they are difficult to map directly because conventional sequencing chemistries are blind to most non-canonical base modifications. DamageFormer addresses this gap with a multimodal deep-learning framework that detects and localizes DNA lesions directly from native (PCR-free) nanopore sequencing reads, exploiting the subtle perturbations a damaged base imparts on the raw ionic current signal as DNA translocates through the pore.
Developed by Yang, Li, Ma, and Yin at the University of Florida's Department of Health Outcomes & Biomedical Informatics (HOBI / Yin Lab) and posted to bioRxiv in May 2026, DamageFormer pairs a damage-aware genomic foundation model with a dedicated nanopore signal encoder. Its core component, LesionBERT, is fine-tuned from DNABERT-2 using lesion-focused masked-reconstruction objectives so that the language-model representation becomes sensitized to the sequence context surrounding damage. This is then fused with raw-signal features to produce per-position lesion predictions.
What distinguishes DamageFormer is its generalization: the model transfers zero-shot to chemically distinct lesion types it never saw during training, reaching an AUROC of 0.99997 on held-out damage chemistries. This suggests the framework learns transferable signatures of structural distortion rather than memorizing individual lesion fingerprints.
DamageFormer is a two-branch architecture. The sequence branch, LesionBERT, inherits the DNABERT-2 transformer backbone and is adapted via lesion-focused masked-language modeling together with a low-rank (LoRA) adapter for parameter-efficient fine-tuning. The signal branch encodes raw nanopore current with convolutional layers followed by a bidirectional LSTM to capture local and sequential dependencies in the translocation trace. The two modality embeddings are combined by an adaptive gating module that dynamically weights sequence-context versus signal evidence before a prediction head emits lesion calls. Inference is run through inference_multimodal.py, which loads a trained model from a --checkpoint together with the foundation-model weights specified by --pretrained_dir. On zero-shot evaluation against lesion chemistries excluded from training, the framework reports an AUROC of 0.99997.
DamageFormer is aimed at researchers studying genome integrity, DNA-repair biology, mutagenesis, environmental and chemical genotoxicity, and aging, where knowing precisely where lesions occur is essential. Because it reads native nanopore signal rather than amplified DNA, it suits workflows that must preserve fragile modifications, and its zero-shot transfer makes it attractive for surveying novel or uncharacterized damage chemistries without assembling new labeled training sets for each.
DamageFormer demonstrates that pairing a damage-aware genomic foundation model with raw-signal encoding can turn standard nanopore sequencing into a direct DNA-damage assay, and its strong cross-chemistry generalization points toward a single tool that maps diverse lesion types. As a recent (May 2026) bioRxiv preprint, its real-world adoption and independent benchmarking remain to be established, and the near-perfect reported AUROC warrants validation on broader, biologically realistic datasets. The code is released under the MIT license on GitHub; the pretrained weights are stated to be available but are not yet present in the repository tree (which currently contains only source code), and no separate license is specified for the weights themselves.
Yang, Q., et al. (2026) DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing. bioRxiv.
DOI: 10.64898/2026.05.14.725245