bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & GeneProteinSingle-cell

RepGene

BGI Research

A lightweight self-supervised framework that maps five biological gene views into one shared latent space, designed to stay usable when modalities are missing at inference.

Released: June 2026

Genes can be described through many heterogeneous biological "views"—their genomic DNA sequence, transcript sequence, encoded protein sequence, accumulated textual knowledge, and single-cell expression context. In practice, embeddings derived from each of these sources are modality-specific: they live in different latent spaces, are hard to compare, and become unusable when some views are unavailable for a given gene. RepGene, introduced in a June 2026 bioRxiv preprint from BGI Research (Beijing), investigates a narrower but practical question—whether pretrained embeddings from these distinct sources can be organized into a single shared gene representation interface that remains usable under severe missing-modality conditions.

Rather than learning each modality from raw data, RepGene operates on frozen upstream embeddings and learns to fuse them into one latent space via self-supervised cross-view objectives. The authors are explicit that this is a feasibility study: they do not claim a new multimodal learning principle or superiority over all simpler fusion baselines, but instead provide an initial technical instantiation to test whether such a shared interface is workable in a fixed-feature setting. This candid framing places RepGene among recent efforts to build reusable, modality-agnostic gene representations rather than as a finished foundation model.

#Key Features

  • Five-view unification: Maps genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context into one shared latent space, so embeddings from different sources can be compared and reused.
  • Presence-aware fusion: A fusion mechanism that accounts for which views are actually observed, allowing the model to operate when only a subset of modalities is available at inference time.
  • Self-supervised cross-view training: Learns the shared interface from frozen upstream embeddings using cross-view objectives, with no labeled supervision in the pretraining stage.
  • Robustness to missing views: The strongest reported signal is stability under ablated modalities—average performance changes are often limited when a single view is removed, and even single-view inference remains informative in the evaluated benchmarks.
  • Lightweight single-branch design: Combines modality adapters and a shared encoder in a compact architecture rather than a large per-modality model stack.

#Technical Details

RepGene is a lightweight single-branch framework built from four parts: per-modality adapters that project each upstream embedding into a common space, a shared encoder, a presence-aware fusion module, and self-supervised cross-view training objectives. It follows a two-stage protocol—RepGene is first trained self-supervised on frozen upstream embeddings, then evaluated by downstream linear probing on the resulting fixed features (a fixed-checkpoint inference pattern). In the full-modality setting the learned representation is reported as broadly competitive, and it remains informative when only partial modality subsets are observed. The authors caution that results should be read in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure; specific upstream embedding models, dataset scale, and parameter counts are not extractable from the abstract and should be confirmed against the full text. As of this version-1 preprint, no public code or model weights were located.

#Applications

RepGene is aimed at computational biologists who want a single, reusable gene representation that can be queried even when only some biological views exist for a gene—a common situation in heterogeneous gene-annotation and integration pipelines. Because downstream use is frozen-feature linear probing, the learned embeddings can be dropped into lightweight classifiers for tasks such as gene-property prediction without retraining the underlying model. Its missing-modality robustness is most relevant where genomic, transcript, protein, textual, and single-cell information are unevenly available across a gene set.

#Impact

RepGene contributes to the growing line of work on unified, modality-agnostic biological representations, and its emphasis on graceful degradation under missing views is a useful framing for real-world integration. Its near-term reach is limited: the authors position it explicitly as a feasibility study and starting point rather than a resolved method, no weights or code were found, and the results await peer review, stronger baselines, broader benchmarks, and leakage-aware validation. As a June 2026 preprint, its claims should be treated as preliminary.

Citation

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Hou, H., et al. (2026) RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views. openRxiv.

DOI: 10.64898/2026.06.11.731512

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References28

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
22Closed
Usability — can I run it?15
Reproducibility — can I retrain it?14
Model Openness Framework
Unclassified
Missing required components

Tags

representation_learninggene_representationmultimodal_fusiontransformerautoencoderself_supervisedmultimodalrepresentation_learninggenomicssingle_cell

Resources

Research PaperOfficial Website