JEPA-DNA

Genomic foundation model training framework whose joint-embedding predictive objective learns functional representations of masked DNA, not tokens.

Released: February 2026

Most genomic foundation models are trained with token-level objectives—masked or autoregressive prediction of nucleotides—which force the model to recover exact sequence content. That signal can over-emphasize low-level token statistics rather than the functional meaning of a region. JEPA-DNA, from NVIDIA's digital biology group in a February 2026 arXiv preprint, brings the Joint-Embedding Predictive Architecture (JEPA) idea from representation learning into genomics: instead of reconstructing masked nucleotides, the model predicts the functional representations of masked genomic segments, shifting the learning signal from token recovery to semantic alignment.

Rather than being a single new architecture trained from scratch, JEPA-DNA is a training framework that combines this joint-embedding predictive objective with conventional generative objectives, and can be applied as continual training to existing backbones. The authors report that this produces fixed genomic foundation-model checkpoints with improved performance across a broad evaluation suite, and they release the training and benchmarking code under Apache 2.0.

Key Features

Joint-embedding predictive objective: Predicts learned representations of masked genomic segments instead of raw nucleotides, emphasizing functional semantics over token recovery.
Hybrid training signal: Combines the JEPA objective with traditional generative objectives within a single training framework.
Applicable to existing backbones: Implemented as continual training, with preconfigured setups for backbones such as DNABERT-2, NTv3, and HyenaDNA.
Broad benchmark gains: Reports improvements across 17 genomic benchmarks, establishing state-of-the-art results among genomic foundation models.
Open code (Apache 2.0): Pretraining and benchmarking code is publicly released.

Technical Details

JEPA-DNA augments genomic pretraining with a joint-embedding predictive objective: a context encoder and a target encoder produce representations, and a predictor learns to map masked context to the target encoder's functional embeddings, complementing generative losses. The released repository provides run_jepa_pretrain.py for pretraining and integrates with a separate GFMBench-API for evaluation, with preconfigured parameter files for DNABERT-2, NTv3, and HyenaDNA backbones. Training produces context-encoder, target-encoder, and predictor checkpoints. Across 17 genomic benchmarks the framework is reported to set state-of-the-art results for genomic foundation models. The repository ships code under Apache 2.0 but does not release pretrained checkpoints, and training-data sources are configurable rather than fixed; specific corpora and quantitative results should be confirmed against the paper.

Applications

JEPA-DNA is primarily a recipe for improving genomic foundation models, so its main beneficiaries are groups that train or fine-tune DNA models and want stronger, more functionally-grounded representations for downstream tasks such as regulatory element classification, variant effect prediction, and other GFMBench-style benchmarks. Because it operates as continual training over existing backbones, teams can upgrade models they already use rather than retraining from scratch.

Impact

JEPA-DNA imports a representation-learning paradigm that has reshaped vision and speech into genomics, arguing that predicting functional embeddings is a better objective than reconstructing tokens for DNA. Its broad reported benchmark gains and open Apache-2.0 code make the approach easy to evaluate, though the absence of released checkpoints means practitioners must run the training themselves. As a February 2026 preprint, its conclusions await peer review and independent replication.

Citation

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Preprint

Larey, A., et al. (2026) JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures. arXiv.org.

DOI: 10.48550/arXiv.2602.17162

Recent citations

Papers that recently cited this model.

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms
Saiyang Feng, Yuanyu Zhang, Shi Li
Jul 2026
0
AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models
Yuanyu Zhang, Shi Li
May 2026
0
WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records
R. Dong, Yuanyu Zhang, Shi Li
May 2026
0

Top citations

The most-cited papers that cite this model.

Discriminative Representation Learning for Clinical Prediction
Yang Zhang, Lianyi Fan, Sam Lawrence, et al.
Mar 2026
2
Learning Clinical Representations Under Systematic Distribution Shift
Yuanyu Zhang, Shi Li
Mar 2026
2
Uncertainty-Aware Foundation Models for Clinical Data
Qian Zhou, Yuanyu Zhang, Shi Li
Apr 2026
1
MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms
Saiyang Feng, Yuanyu Zhang, Shi Li
Jul 2026
0
AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models
Yuanyu Zhang, Shi Li
May 2026
0

Citations

Total Citations8

Influential0

References41

GitHub

Stars13

Forks2

Open Issues0

Contributors1

Last Push17h ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science100%
Medicine75%
Biology13%
Mathematics13%
Engineering13%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

54Partial

Usability — can I run it?58

Reproducibility — can I retrain it?62

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Joint-embedding predictive objective: Predicts learned representations of masked genomic segments instead of raw nucleotides, emphasizing functional semantics over token recovery.

Hybrid training signal: Combines the JEPA objective with traditional generative objectives within a single training framework.

Applicable to existing backbones: Implemented as continual training, with preconfigured setups for backbones such as DNABERT-2, NTv3, and HyenaDNA.

Broad benchmark gains: Reports improvements across 17 genomic benchmarks, establishing state-of-the-art results among genomic foundation models.

Open code (Apache 2.0): Pretraining and benchmarking code is publicly released.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

Saiyang Feng, Yuanyu Zhang, Shi Li

Jul 2026

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Yuanyu Zhang, Shi Li

May 2026

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

R. Dong, Yuanyu Zhang, Shi Li

May 2026

Top citations

The most-cited papers that cite this model.

Discriminative Representation Learning for Clinical Prediction

Yang Zhang, Lianyi Fan, Sam Lawrence, et al.

Mar 2026

Learning Clinical Representations Under Systematic Distribution Shift

Yuanyu Zhang, Shi Li

Mar 2026

Uncertainty-Aware Foundation Models for Clinical Data

Qian Zhou, Yuanyu Zhang, Shi Li

Apr 2026

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

Saiyang Feng, Yuanyu Zhang, Shi Li

Jul 2026

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Yuanyu Zhang, Shi Li

May 2026

JEPA-DNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Recent citations

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

Top citations

Discriminative Representation Learning for Clinical Prediction

Learning Clinical Representations Under Systematic Distribution Shift

Uncertainty-Aware Foundation Models for Clinical Data

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

JEPA-DNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Recent citations

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

Top citations

Discriminative Representation Learning for Clinical Prediction

Learning Clinical Representations Under Systematic Distribution Shift

Uncertainty-Aware Foundation Models for Clinical Data

MorphologyFM: A Foundation Model for Morphology-Aware Representation Learning from ECG and Pulse Oximetry Waveforms

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact