DamageFormer

Multimodal framework that detects and localizes DNA lesions from native nanopore signal, built on the damage-aware LesionBERT foundation model.

Released: May 2026

DNA lesions—chemically altered or damaged bases arising from oxidation, alkylation, UV exposure, and other insults—are central to mutagenesis, aging, and disease, yet they are difficult to map directly because conventional sequencing chemistries are blind to most non-canonical base modifications. DamageFormer addresses this gap with a multimodal deep-learning framework that detects and localizes DNA lesions directly from native (PCR-free) nanopore sequencing reads, exploiting the subtle perturbations a damaged base imparts on the raw ionic current signal as DNA translocates through the pore.

Developed by Yang, Li, Ma, and Yin at the University of Florida's Department of Health Outcomes & Biomedical Informatics (HOBI / Yin Lab) and posted to bioRxiv in May 2026, DamageFormer pairs a damage-aware genomic foundation model with a dedicated nanopore signal encoder. Its core component, LesionBERT, is fine-tuned from DNABERT-2 using lesion-focused masked-reconstruction objectives so that the language-model representation becomes sensitized to the sequence context surrounding damage. This is then fused with raw-signal features to produce per-position lesion predictions.

What distinguishes DamageFormer is its generalization: the model transfers zero-shot to chemically distinct lesion types it never saw during training, reaching an AUROC of 0.99997 on held-out damage chemistries. This suggests the framework learns transferable signatures of structural distortion rather than memorizing individual lesion fingerprints.

Key Features

Native nanopore detection: Operates directly on raw, PCR-free nanopore signal, so labile and non-canonical lesions are preserved rather than erased by amplification or bisulfite-style conversion chemistry.
LesionBERT foundation model: A damage-aware genomic encoder fine-tuned from DNABERT-2 with lesion-focused masked-reconstruction objectives plus a LoRA adapter, enabling efficient specialization without retraining the full backbone.
Multimodal adaptive gating: A CNN/BiLSTM signal encoder is fused with the LesionBERT sequence representation through an adaptive gating mechanism that learns how much to weight each modality per position.
Zero-shot cross-chemistry generalization: Detects chemically distinct lesion types absent from training, achieving AUROC 0.99997 and indicating the model captures general damage signatures.
Per-position localization: Beyond binary detection, the framework localizes lesions along the read, supporting fine-grained damage mapping.

Technical Details

DamageFormer is a two-branch architecture. The sequence branch, LesionBERT, inherits the DNABERT-2 transformer backbone and is adapted via lesion-focused masked-language modeling together with a low-rank (LoRA) adapter for parameter-efficient fine-tuning. The signal branch encodes raw nanopore current with convolutional layers followed by a bidirectional LSTM to capture local and sequential dependencies in the translocation trace. The two modality embeddings are combined by an adaptive gating module that dynamically weights sequence-context versus signal evidence before a prediction head emits lesion calls. Inference is run through inference_multimodal.py, which loads a trained model from a --checkpoint together with the foundation-model weights specified by --pretrained_dir. On zero-shot evaluation against lesion chemistries excluded from training, the framework reports an AUROC of 0.99997.

Applications

DamageFormer is aimed at researchers studying genome integrity, DNA-repair biology, mutagenesis, environmental and chemical genotoxicity, and aging, where knowing precisely where lesions occur is essential. Because it reads native nanopore signal rather than amplified DNA, it suits workflows that must preserve fragile modifications, and its zero-shot transfer makes it attractive for surveying novel or uncharacterized damage chemistries without assembling new labeled training sets for each.

Impact

DamageFormer demonstrates that pairing a damage-aware genomic foundation model with raw-signal encoding can turn standard nanopore sequencing into a direct DNA-damage assay, and its strong cross-chemistry generalization points toward a single tool that maps diverse lesion types. As a recent (May 2026) bioRxiv preprint, its real-world adoption and independent benchmarking remain to be established, and the near-perfect reported AUROC warrants validation on broader, biologically realistic datasets. The code is released under the MIT license on GitHub; the pretrained weights are stated to be available but are not yet present in the repository tree (which currently contains only source code), and no separate license is specified for the weights themselves.

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q., et al. (2026) DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing. bioRxiv.

DOI: 10.64898/2026.05.14.725245

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References90

GitHub

Stars1

Forks0

Open Issues0

Contributors1

Last Push2mo ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

45Partial

Usability — can I run it?64

Reproducibility — can I retrain it?19

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Native nanopore detection: Operates directly on raw, PCR-free nanopore signal, so labile and non-canonical lesions are preserved rather than erased by amplification or bisulfite-style conversion chemistry.

LesionBERT foundation model: A damage-aware genomic encoder fine-tuned from DNABERT-2 with lesion-focused masked-reconstruction objectives plus a LoRA adapter, enabling efficient specialization without retraining the full backbone.

Multimodal adaptive gating: A CNN/BiLSTM signal encoder is fused with the LesionBERT sequence representation through an adaptive gating mechanism that learns how much to weight each modality per position.

Zero-shot cross-chemistry generalization: Detects chemically distinct lesion types absent from training, achieving AUROC 0.99997 and indicating the model captures general damage signatures.

Per-position localization: Beyond binary detection, the framework localizes lesions along the read, supporting fine-grained damage mapping.

Technical Details

Applications

Impact

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q., et al. (2026) DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing. bioRxiv.

DOI: 10.64898/2026.05.14.725245

DamageFormer

Key Features

Technical Details

Applications

Impact

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

DamageFormer

Key Features

Technical Details

Applications

Impact

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

DamageFormer

#Key Features

#Technical Details

#Applications

#Impact

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

DamageFormer

#Key Features

#Technical Details

#Applications

#Impact

Citation

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact