Genes can be described through many heterogeneous biological "views"—their genomic DNA sequence, transcript sequence, encoded protein sequence, accumulated textual knowledge, and single-cell expression context. In practice, embeddings derived from each of these sources are modality-specific: they live in different latent spaces, are hard to compare, and become unusable when some views are unavailable for a given gene. RepGene, introduced in a June 2026 bioRxiv preprint from BGI Research (Beijing), investigates a narrower but practical question—whether pretrained embeddings from these distinct sources can be organized into a single shared gene representation interface that remains usable under severe missing-modality conditions.

Rather than learning each modality from raw data, RepGene operates on frozen upstream embeddings and learns to fuse them into one latent space via self-supervised cross-view objectives. The authors are explicit that this is a feasibility study: they do not claim a new multimodal learning principle or superiority over all simpler fusion baselines, but instead provide an initial technical instantiation to test whether such a shared interface is workable in a fixed-feature setting. This candid framing places RepGene among recent efforts to build reusable, modality-agnostic gene representations rather than as a finished foundation model.

Key Features

Five-view unification: Maps genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context into one shared latent space, so embeddings from different sources can be compared and reused.
Presence-aware fusion: A fusion mechanism that accounts for which views are actually observed, allowing the model to operate when only a subset of modalities is available at inference time.
Self-supervised cross-view training: Learns the shared interface from frozen upstream embeddings using cross-view objectives, with no labeled supervision in the pretraining stage.
Robustness to missing views: The strongest reported signal is stability under ablated modalities—average performance changes are often limited when a single view is removed, and even single-view inference remains informative in the evaluated benchmarks.
Lightweight single-branch design: Combines modality adapters and a shared encoder in a compact architecture rather than a large per-modality model stack.

Technical Details

RepGene is a lightweight single-branch framework built from four parts: per-modality adapters that project each upstream embedding into a common space, a shared encoder, a presence-aware fusion module, and self-supervised cross-view training objectives. It follows a two-stage protocol—RepGene is first trained self-supervised on frozen upstream embeddings, then evaluated by downstream linear probing on the resulting fixed features (a fixed-checkpoint inference pattern). In the full-modality setting the learned representation is reported as broadly competitive, and it remains informative when only partial modality subsets are observed. The authors caution that results should be read in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure; specific upstream embedding models, dataset scale, and parameter counts are not extractable from the abstract and should be confirmed against the full text. As of this version-1 preprint, no public code or model weights were located.

Applications

RepGene is aimed at computational biologists who want a single, reusable gene representation that can be queried even when only some biological views exist for a gene—a common situation in heterogeneous gene-annotation and integration pipelines. Because downstream use is frozen-feature linear probing, the learned embeddings can be dropped into lightweight classifiers for tasks such as gene-property prediction without retraining the underlying model. Its missing-modality robustness is most relevant where genomic, transcript, protein, textual, and single-cell information are unevenly available across a gene set.

Impact

RepGene contributes to the growing line of work on unified, modality-agnostic biological representations, and its emphasis on graceful degradation under missing views is a useful framing for real-world integration. Its near-term reach is limited: the authors position it explicitly as a feasibility study and starting point rather than a resolved method, no weights or code were found, and the results await peer review, stronger baselines, broader benchmarks, and leakage-aware validation. As a June 2026 preprint, its claims should be treated as preliminary.

Key Features

Five-view unification: Maps genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context into one shared latent space, so embeddings from different sources can be compared and reused.

Presence-aware fusion: A fusion mechanism that accounts for which views are actually observed, allowing the model to operate when only a subset of modalities is available at inference time.

Self-supervised cross-view training: Learns the shared interface from frozen upstream embeddings using cross-view objectives, with no labeled supervision in the pretraining stage.

Robustness to missing views: The strongest reported signal is stability under ablated modalities—average performance changes are often limited when a single view is removed, and even single-view inference remains informative in the evaluated benchmarks.

Lightweight single-branch design: Combines modality adapters and a shared encoder in a compact architecture rather than a large per-modality model stack.

Technical Details

Applications

Impact

RepGene

Key Features

Technical Details

Applications

Impact

Citation

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

RepGene

Key Features

Technical Details

Applications

Impact

Citation

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

RepGene

#Key Features

#Technical Details

#Applications

#Impact

Citation

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

RepGene

#Key Features

#Technical Details

#Applications

#Impact

Citation

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact