skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 15, 2026

Title: How large is the universe of RNA-like motifs? A clustering analysis of RNA graph motifs using topological descriptors
Identifying novel and functional RNA structures remains a significant challenge in RNA motif design and is crucial for developing RNA-based therapeutics. Here we introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% (121 dual graphs) correspond to approximately 200,000 known RNA atomic fragments/substructures (collected in 2021) using the RNA-as-Graphs (RAG) framework. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters. The cluster with the higher percentage of known dual graphs for RNA is defined as the “RNA-like cluster, while the other is considered as “non-RNA-like. The distance between each dual graph and the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Among the top 15 graphs identified as high-likelihood candidates for novel RNA motifs, 4 were confirmed from the RNA dataset collected in 2022. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs (subgraphs preserve pseudoknots and junctions). Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features (e.g., Betti-0 and Betti-1 numbers). These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.  more » « less
Award ID(s):
2151777 2330628
PAR ID:
10652721
Author(s) / Creator(s):
;
Editor(s):
Shao, Mingfu
Publisher / Repository:
PLOS
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
21
Issue:
7
ISSN:
1553-7358
Page Range / eLocation ID:
e1013230
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. RNA motif classification is important for understanding structure/function connections and building phylogenetic relationships. Using our coarse-grained RNA-As-Graphs (RAG) representations, we identify recurrent dual graph motifs in experimentally solved RNA structures based on an improved search algorithm that finds and ranks independent RNA substructures. Our expanded list of 183 existing dual graph motifs reveals five common motifs found in transfer RNA, riboswitch, and ribosomal 5S RNA components. Moreover, we identify three motifs for available viral frameshifting RNA elements, suggesting a correlation between viral structural complexity and frameshifting efficiency. We further partition the RNA substructures into 1844 distinct submotifs, with pseudoknots and junctions retained intact. Common modules are internal loops and three-way junctions, and three submotifs are associated with riboswitches that bind nucleotides, ions, and signaling molecules. Together, our library of existing RNA motifs and submotifs adds to the growing universe of RNA modules, and provides a resource of structures and substructures for novel RNA design. 
    more » « less
  2. null (Ed.)
    We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs. 
    more » « less
  3. null (Ed.)
    Nucleic acid nanostructures with different chemical compositions have shown utility in biological applications as they provide additional assembly parameters and enhanced stability. The naturally occurring 2′-5′ linkage in RNA is thought to be a prebiotic analogue and has potential use in antisense therapeutics. Here, we report the first instance of DNA/RNA motifs containing 2′-5′ linkages. We synthesized and incorporated RNA strands with 2′-5′ linkages into different DNA motifs with varying number of branch points (a duplex, four arm junction, double crossover motif and tensegrity triangle motif). Using experimental characterization and molecular dynamics simulations, we show that hybrid DNA/RNA nanostructures can accommodate interspersed 2′-5′ linkages with relatively minor effect on the formation of these structures. Further, the modified nanostructures showed improved resistance to ribonuclease cleavage, indicating their potential use in the construction of robust drug delivery vehicles with prolonged stability in physiological conditions. 
    more » « less
  4. Pattern counting in graphs is fundamental to several network sci- ence tasks, and there is an abundance of scalable methods for estimating counts of small patterns, often called motifs, in large graphs. However, modern graph datasets now contain richer structure, and incorporating temporal information in particular has become a key part of network analysis. Consequently, temporal motifs, which are generalizations of small subgraph patterns that incorporate temporal ordering on edges, are an emerging part of the network analysis toolbox. However, there are no algorithms for fast estimation of temporal motifs counts; moreover, we show that even counting simple temporal star motifs is NP-complete. Thus, there is a need for fast and approximate algorithms. Here, we present the first frequency estimation algorithms for counting temporal motifs. More specifically, we develop a sampling framework that sits as a layer on top of existing exact counting algorithms and enables fast and accurate memory-efficient estimates of temporal motif counts. Our results show that we can achieve one to two orders of magnitude speedups over existing algorithms with minimal and controllable loss in accuracy on a number of datasets. 
    more » « less
  5. Recent advances in high-resolution connectomics provide researchers access to accurate reconstructions of vast neuronal circuits and brain networks for the first time. Neuroscientists anticipate analyzing these networks to gain a better understanding of information processing in the brain. In particular, scientists are interested in identifying specific network motifs, i.e., repeating subgraphs of the larger brain network that are believed to be neuronal building blocks. To analyze these motifs, it is crucial to review instances of a motif in the brain network and then map the graph structure to the detailed 3D reconstructions of the involved neurons and synapses. We present Vimo, an interactive visual approach to analyze neuronal motifs and motif chains in large brain networks. Experts can sketch network motifs intuitively in a visual interface and specify structural properties of the involved neurons and synapses to query large connectomics datasets. Motif instances (MIs) can be explored in high-resolution 3D renderings of the involved neurons and synapses. To reduce visual clutter and simplify the analysis of MIs, we designed a continuous focus&context metaphor inspired by continuous visual abstractions [MAAB∗18] that allows the user to transition from the highly-detailed rendering of the anatomical structure to views that emphasize the underlying motif structure and synaptic connectivity. Furthermore, Vimo supports the identification of motif chains where a motif is used repeatedly to form a longer synaptic chain. We evaluate Vimo in a user study with seven domain experts and an in-depth case study on motifs in the central complex (CX) of the fruit fly brain. 
    more » « less