Kyoto University / Waseda University
A VAE-based generative model that designs novel functional RNA sequences by encoding MSA and consensus secondary structure constraints from Rfam families.
RfamGen is a deep generative model for the data-efficient design of functional RNA family sequences. Developed by researchers at Kyoto University and Waseda University, the model addresses a core challenge in RNA design: most generative approaches treat sequences as simple strings and discard the structural and evolutionary context encoded in multiple sequence alignments (MSAs). RfamGen instead builds those constraints directly into its architecture by grounding the generative process in covariance models (CMs), probabilistic representations that jointly capture sequence conservation and consensus secondary structure across RNA families.
The model frames RNA design as a learned sampling problem over a continuous latent space. By encoding alignment features derived from CMs into a Variational Autoencoder (VAE), RfamGen learns a semantically structured representation in which nearby points in latent space correspond to sequences with similar functional and structural properties. Novel sequences are generated by sampling from this latent space and decoding through the CM, which constrains outputs to respect the base-pairing and conservation patterns that define a given RNA family.
RfamGen was validated across 18 diverse RNA families drawn from the Rfam database, each with alignments of at least 10,000 sequences. In a key experimental test, RfamGen-designed ribozyme sequences demonstrated measurable enzymatic activity as assayed by quantitative massively parallel assays, while randomly sampled sequences from the same families did not. This direct functional validation distinguishes RfamGen from models evaluated solely on sequence statistics.
RfamGen employs a VAE framework in which both the encoder and decoder are built around covariance models. A CM is a probabilistic graphical model on a tree structure that combines a profile hidden Markov model with RNA secondary structure, representing paired and unpaired positions in an RNA consensus fold. This representation allows RfamGen to formalize an MSA under the constraint of a known secondary structure, capturing co-evolutionary signals between paired bases that are essential for functional RNA design.
During training, the model vectorizes alignment features derived from a target RNA family's CM and learns to map these features to a continuous Gaussian latent space using the VAE objective. Sampling from the prior and decoding through the CM-based decoder generates novel paths on the model — equivalent to sampling aligned sequences that respect the structural grammar of the RNA family. Training data is drawn entirely from Rfam, a curated database of non-coding RNA families that provides MSAs, consensus secondary structures, and covariance models.
Evaluation against baseline generative approaches across 18 RNA families showed that RfamGen consistently produced sequences that more closely recapitulate the statistical properties of natural family members. The experimental ribozyme design experiment provided functional confirmation that the latent space encodes biologically meaningful variation beyond sequence-level statistics.
RfamGen is primarily applicable to RNA synthetic biology and engineering tasks where functional sequence design is the goal. Researchers working on ribozyme engineering can use the model to generate candidate catalytic RNA sequences with predictable structural characteristics, reducing the combinatorial search space before experimental screening. The model is relevant to RNA-based therapeutics development, where novel non-coding RNA sequences with specific structural scaffolds are needed. More broadly, RfamGen provides a framework for exploring the sequence space of any Rfam-catalogued RNA family in a structurally guided way, supporting evolutionary studies of sequence-structure-function relationships and the development of RNA aptamers or regulatory elements for synthetic genetic circuits.
RfamGen was published in Nature Methods in 2024 and represents one of the first generative models for RNA design to achieve direct experimental functional validation at scale. By demonstrating that a structure-aware VAE can produce ribozymes with genuine catalytic activity, the work establishes a benchmark for what constitutes meaningful success in computational RNA design, shifting evaluation from sequence statistics toward functional assays. The covariance model integration is a methodologically distinct contribution that may influence future generative approaches to structured RNA families. A notable limitation is that RfamGen requires a well-characterized Rfam family with a high-quality CM and a sufficient number of aligned sequences, which constrains its immediate applicability to RNA families that lack deep Rfam coverage or for which no consensus secondary structure exists.
Sumi, S., Hamada, M. & Saito, H. Deep generative design of RNA family sequences. Nat Methods 21, 435–443 (2024).
DOI: 10.1038/s41592-023-02148-8