A lightweight self-supervised framework that maps five biological gene views into one shared latent space, designed to stay usable when modalities are missing at inference.
Genes can be described through many heterogeneous biological "views"—their genomic DNA sequence, transcript sequence, encoded protein sequence, accumulated textual knowledge, and single-cell expression context. In practice, embeddings derived from each of these sources are modality-specific: they live in different latent spaces, are hard to compare, and become unusable when some views are unavailable for a given gene. RepGene, introduced in a June 2026 bioRxiv preprint from BGI Research (Beijing), investigates a narrower but practical question—whether pretrained embeddings from these distinct sources can be organized into a single shared gene representation interface that remains usable under severe missing-modality conditions.
Rather than learning each modality from raw data, RepGene operates on frozen upstream embeddings and learns to fuse them into one latent space via self-supervised cross-view objectives. The authors are explicit that this is a feasibility study: they do not claim a new multimodal learning principle or superiority over all simpler fusion baselines, but instead provide an initial technical instantiation to test whether such a shared interface is workable in a fixed-feature setting. This candid framing places RepGene among recent efforts to build reusable, modality-agnostic gene representations rather than as a finished foundation model.
RepGene is a lightweight single-branch framework built from four parts: per-modality adapters that project each upstream embedding into a common space, a shared encoder, a presence-aware fusion module, and self-supervised cross-view training objectives. It follows a two-stage protocol—RepGene is first trained self-supervised on frozen upstream embeddings, then evaluated by downstream linear probing on the resulting fixed features (a fixed-checkpoint inference pattern). In the full-modality setting the learned representation is reported as broadly competitive, and it remains informative when only partial modality subsets are observed. The authors caution that results should be read in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure; specific upstream embedding models, dataset scale, and parameter counts are not extractable from the abstract and should be confirmed against the full text. As of this version-1 preprint, no public code or model weights were located.
RepGene is aimed at computational biologists who want a single, reusable gene representation that can be queried even when only some biological views exist for a gene—a common situation in heterogeneous gene-annotation and integration pipelines. Because downstream use is frozen-feature linear probing, the learned embeddings can be dropped into lightweight classifiers for tasks such as gene-property prediction without retraining the underlying model. Its missing-modality robustness is most relevant where genomic, transcript, protein, textual, and single-cell information are unevenly available across a gene set.
RepGene contributes to the growing line of work on unified, modality-agnostic biological representations, and its emphasis on graceful degradation under missing views is a useful framing for real-world integration. Its near-term reach is limited: the authors position it explicitly as a feasibility study and starting point rather than a resolved method, no weights or code were found, and the results await peer review, stronger baselines, broader benchmarks, and leakage-aware validation. As a June 2026 preprint, its claims should be treated as preliminary.
Hou, H., et al. (2026) RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views. openRxiv.
DOI: 10.64898/2026.06.11.731512Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data