bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
RNA

RfamGen

Kyoto University / Waseda University

A VAE-based generative model that designs novel functional RNA sequences by encoding MSA and consensus secondary structure constraints from Rfam families.

Released: 2024

Overview

RfamGen is a deep generative model for the data-efficient design of functional RNA family sequences. Developed by researchers at Kyoto University and Waseda University, the model addresses a core challenge in RNA design: most generative approaches treat sequences as simple strings and discard the structural and evolutionary context encoded in multiple sequence alignments (MSAs). RfamGen instead builds those constraints directly into its architecture by grounding the generative process in covariance models (CMs), probabilistic representations that jointly capture sequence conservation and consensus secondary structure across RNA families.

The model frames RNA design as a learned sampling problem over a continuous latent space. By encoding alignment features derived from CMs into a Variational Autoencoder (VAE), RfamGen learns a semantically structured representation in which nearby points in latent space correspond to sequences with similar functional and structural properties. Novel sequences are generated by sampling from this latent space and decoding through the CM, which constrains outputs to respect the base-pairing and conservation patterns that define a given RNA family.

RfamGen was validated across 18 diverse RNA families drawn from the Rfam database, each with alignments of at least 10,000 sequences. In a key experimental test, RfamGen-designed ribozyme sequences demonstrated measurable enzymatic activity as assayed by quantitative massively parallel assays, while randomly sampled sequences from the same families did not. This direct functional validation distinguishes RfamGen from models evaluated solely on sequence statistics.

Key Features

  • Covariance model VAE: Integrates CM-based encoding directly into the VAE architecture, embedding both sequence conservation and RNA secondary structure constraints into the generative process rather than treating them as post-hoc filters.
  • Structure-aware latent space: The learned latent representation captures functionally relevant sequence variation, enabling smooth interpolation and targeted sampling toward sequences with desired properties.
  • Data-efficient design: Explicit use of MSA and secondary structure information allows the model to learn from fewer examples than purely sequence-based generative models, making it viable for RNA families with limited characterized members.
  • Experimentally validated functionality: Ribozyme sequences generated by RfamGen exhibit catalytic activity measured through quantitative massively parallel assays, providing direct wet-lab evidence of biological utility.
  • Broad RNA family coverage: Demonstrated on 18 distinct RNA families spanning diverse structural classes, including ribozymes, riboswitches, and structural RNAs.

Technical Details

RfamGen employs a VAE framework in which both the encoder and decoder are built around covariance models. A CM is a probabilistic graphical model on a tree structure that combines a profile hidden Markov model with RNA secondary structure, representing paired and unpaired positions in an RNA consensus fold. This representation allows RfamGen to formalize an MSA under the constraint of a known secondary structure, capturing co-evolutionary signals between paired bases that are essential for functional RNA design.

During training, the model vectorizes alignment features derived from a target RNA family's CM and learns to map these features to a continuous Gaussian latent space using the VAE objective. Sampling from the prior and decoding through the CM-based decoder generates novel paths on the model — equivalent to sampling aligned sequences that respect the structural grammar of the RNA family. Training data is drawn entirely from Rfam, a curated database of non-coding RNA families that provides MSAs, consensus secondary structures, and covariance models.

Evaluation against baseline generative approaches across 18 RNA families showed that RfamGen consistently produced sequences that more closely recapitulate the statistical properties of natural family members. The experimental ribozyme design experiment provided functional confirmation that the latent space encodes biologically meaningful variation beyond sequence-level statistics.

Applications

RfamGen is primarily applicable to RNA synthetic biology and engineering tasks where functional sequence design is the goal. Researchers working on ribozyme engineering can use the model to generate candidate catalytic RNA sequences with predictable structural characteristics, reducing the combinatorial search space before experimental screening. The model is relevant to RNA-based therapeutics development, where novel non-coding RNA sequences with specific structural scaffolds are needed. More broadly, RfamGen provides a framework for exploring the sequence space of any Rfam-catalogued RNA family in a structurally guided way, supporting evolutionary studies of sequence-structure-function relationships and the development of RNA aptamers or regulatory elements for synthetic genetic circuits.

Impact

RfamGen was published in Nature Methods in 2024 and represents one of the first generative models for RNA design to achieve direct experimental functional validation at scale. By demonstrating that a structure-aware VAE can produce ribozymes with genuine catalytic activity, the work establishes a benchmark for what constitutes meaningful success in computational RNA design, shifting evaluation from sequence statistics toward functional assays. The covariance model integration is a methodologically distinct contribution that may influence future generative approaches to structured RNA families. A notable limitation is that RfamGen requires a well-characterized Rfam family with a high-quality CM and a sufficient number of aligned sequences, which constrains its immediate applicability to RNA families that lack deep Rfam coverage or for which no consensus secondary structure exists.

Citation

Deep generative design of RNA family sequences

Sumi, S., Hamada, M. & Saito, H. Deep generative design of RNA family sequences. Nat Methods 21, 435–443 (2024).

DOI: 10.1038/s41592-023-02148-8

Metrics

GitHub

Stars40
Forks13
Open Issues2
Contributors1
Last Push16d ago
LanguageJupyter Notebook

Citations

Total Citations50
Influential3
References64

Tags

sequence designvariational autoencoderfoundation model

Resources

GitHub RepositoryResearch Paper