<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Scalable and Interpretable Identification of Minimal Undesignable RNA Structure Motifs with Rotational Invariance</title></titleStmt>
			<publicationStmt>
				<publisher>SPRINGER</publisher>
				<date>04/25/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10673295</idno>
					<idno type="doi"></idno>
					
					<author>Tianshuo Zhou</author><author>Apoorv Malik</author><author>Wei Yu Tang</author><author>David H Mathews</author><author>Liang Huang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[RNA design aims to find a sequence that folds with the highest probability into a designated target structure. However, certain structures are undesignable, meaning no sequence can fold into the target structure under the default (Turner) RNA folding model. Understanding the specific local structures (i.e., “motifs”) that contribute to undesignability is crucial for refining RNA folding models and determining the limits of RNA designability. Despite its importance, this problem has received very little attention, and previous efforts are neither scalable nor interpretable.We develop a new theoretical framework for motif (un-)designability, and design scalable and interpretable algorithms to identify minimal undesignable motifs within a given RNA secondary structure. Our approach establishes motif undesignability by searching for rival motifs, rather than exhaustively enumerating all (partial) sequences that could potentially fold into the motif. Furthermore, we exploit rotational invariance in RNA structures to detect, group and reuse equivalent motifs and to construct a database of unique minimal undesignable motifs. To achieve that, we propose a loop-pair graph representation for motifs and a recursive graph isomorphism algorithm for motif equivalence.Our algorithms successfully identify 24 unique minimal undesignable motifs among 18 undesignable puzzles from the Eterna100 benchmark. Surprisingly, we also find over 350 unique minimal undesignable motifs and 663 undesignable native structures in the ArchiveII dataset, drawn from a diverse set of RNA families.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1 Introduction 5! 3! 5! 3!</p><p>Fig. <ref type="figure">1</ref>: Going beyond structure undesignability in our previous work <ref type="bibr">[35]</ref>, this work finds minimal undesignable motifs in a given RNA structure (two such motifs shown here in Eterna100 puzzle #52). Boundary and internal pairs are in orange and blue, resp.</p><p>RNA design <ref type="bibr">[34,</ref><ref type="bibr">4,</ref><ref type="bibr">22,</ref><ref type="bibr">12,</ref><ref type="bibr">32]</ref> focuses on identifying one or more RNA sequences capable of folding into a target secondary structure, holding significant promise for applications in RNA-based therapeutics and diagnostics <ref type="bibr">[33]</ref>.</p><p>The significance and complexity of RNA design have garnered widespread attention. Numerous methods, including SAMFEO <ref type="bibr">[34]</ref>, NEMO <ref type="bibr">[22]</ref>, RNAiFold <ref type="bibr">[12]</ref>, NUPACK <ref type="bibr">[32]</ref>, and others <ref type="bibr">[4]</ref>, have been developed to generate sequences based on a given target structure. Despite substantial improvements in empirical design quality, it has been noted that certain puzzles in the widely-used benchmark Eterna100 <ref type="bibr">[2]</ref> are considered undesignable, having never been successfully solved <ref type="bibr">[17]</ref>. However, limited research <ref type="bibr">[1]</ref> has been dedicated to exploring the undesignability of RNA structures. Recently, We introduced RIGEND <ref type="bibr">[35]</ref> to identify undesignable RNA structures in a more general manner by pinpointing rival structures, resulting in the identification of 16 puzzles in Eterna100 deemed undesignable by the unique minimum free energy (MFE) criterion under the standard Turner RNA folding model <ref type="bibr">[21,</ref><ref type="bibr">26]</ref> implemented in ViennaRNA <ref type="bibr">[20]</ref> .</p><p>While e!ective, RIGEND has some limitations in terms of explainability and scalability. While rival structures, which consistently exhibit better folding energy compared to the target structure, can serve as compelling evidence for an undesignable structure, their interpretability is often limited to the entire structure. In practice, the undesignability of an RNA structure typically arises from some specific local region, or structure motif. Identifying these critical motif(s) in a structure could o!er deeper insight into why certain structures resist design, enhancing both interpretability and potential reusability in RNA Design. Table <ref type="table">1</ref>: Critical positions of some loops in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5' 3'</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multiloop</head><p>Loop Critical Positions Hairpin&#8594; <ref type="bibr">(12,</ref><ref type="bibr">18)</ref>&#8593; <ref type="bibr">(12,</ref><ref type="bibr">18)</ref>, 13, 17 Bulge&#8594; <ref type="bibr">(10,</ref><ref type="bibr">23)</ref>, <ref type="bibr">(11,</ref><ref type="bibr">19)</ref>&#8593; <ref type="bibr">(10,</ref><ref type="bibr">23)</ref>, <ref type="bibr">(11,</ref><ref type="bibr">19</ref>) Stack&#8594; <ref type="bibr">(3,</ref><ref type="bibr">50)</ref>, <ref type="bibr">(4,</ref><ref type="bibr">49)</ref>&#8593; <ref type="bibr">(3,</ref><ref type="bibr">50)</ref>, (4, 49) Internal&#8594; <ref type="bibr">(29,</ref><ref type="bibr">43)</ref>, <ref type="bibr">(32,</ref><ref type="bibr">39</ref>)&#8593; <ref type="bibr">(29,</ref><ref type="bibr">43)</ref>, <ref type="bibr">(32,</ref><ref type="bibr">39)</ref>, <ref type="bibr">30,</ref><ref type="bibr">42,</ref><ref type="bibr">31,</ref><ref type="bibr">40</ref> External&#8594;(3, 50)&#8593; (3, 50), 2, 51 . . . . . . P indicates two distinct bases x i x j &#8594; {CG, GC, AU, UA, GU, UG} and each index from 1 to n can only be paired once. A secondary structure is pseudoknot-free if there are no two pairs (i, j) &#8594; P and (k, l) &#8594; P such that i &lt; k &lt; j &lt; l. Alternatively, P can be represented as a string y = y 1 y 2 . . . y n , where a pair of indices (i, j) &#8594; P corresponds to y i = "(", y j = ")" and any unpaired index k corresponds to y k = ".". The unpaired indices in y are denoted as unpaired (y) and the set of paired indices in y is denoted as pairs(y), which is equal to P. While some RNA structures in nature contain pseudoknots, we do not consider them in this work as the computational model we use does not allow these. The ensemble of an RNA sequence x is the set of all secondary structures that x can possibly fold into, denoted as Y(x). The free energy #G &#8594; (x, y) is used to characterize the stability of y &#8594; Y(x). The lower the free energy #G &#8594; (x, y), the more stable the secondary structure y for x. In the nearest neighbor energy model <ref type="bibr">[26]</ref>, a secondary structure is decomposed into a collection of loops, where each loop is usually a region enclosed by some base pair(s). Depending on the number of pairs on the boundary, the main types of loops include hairpin loop, internal loop and multiloop, which are bounded by 1, 2 and 3 or more base pairs, respectively. In particular, the external loop is the most outside loop and is bounded by two ends (5 &#8593; and 3 &#8593; ) and other base pair(s). Thus each loop can be identified by a set of pairs. Fig. <ref type="figure">2</ref> showcases an example of secondary structure with loops, and some of the loops are listed in Table <ref type="table">1</ref>.</p><p>The function loops(y) is used to denote the set of loops in a structure y. The free energy of a secondary structure y is the sum of the free energy of each loop, #G &#8594; (x, y) = z&#8595;loops(y) #G &#8594; (x, z), where each term #G &#8594; (x, z) is the energy for one specific loop in loops(y). See Supplementary Section G for detailed energy functions for types of loops in the Turner model <ref type="bibr">[26]</ref>. The energy of each loop is typically determined by the identity of nucleotides on the positions of enclosing pairs and their adjacent mismatch positions, which are named as critical positions in this article. Table <ref type="table">SI</ref> 1 lists the critical positions for all the loops in Fig. <ref type="figure">2</ref> and Table SI 2 shows the indices of critical positions for each type of loops. Additionally, some special hairpins <ref type="bibr">[21]</ref> of unstable triloops and stable tetraloops and hexaloops in Turner model have a separate energy lookup table (See Supplementary Section G.2). When evaluating the energy of a loop, it su"ces to input only the nucleotides on its critical positions, i.e.,</p><p>where critical (z) denotes the critical positions of loop z and x &#8593; critical (z) denotes the nucleotides from x that are "projected" onto critical (z). See Supplementary Section B for the detailed functionality of projection operator. The projection (&#8593;) allows us to focus on the relevant nucleotides for energy evaluation. Table <ref type="table">1</ref> list the critical positions for the example loops.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Undesignability by the Unique MFE Criterion</head><p>The structure with the minimum free energy is the most stable structure in the ensemble. A structure y &#969; is an MFE structure of x if and only if</p><p>RNA design is the inverse problem of RNA folding. Given a target structure y &#969; , RNA design aims to find suitable RNA sequence x such that y &#969; is an MFE structure of x. Here we follow a more strict definition of MFE criterion adopted in some previous studies <ref type="bibr">[6,</ref><ref type="bibr">14,</ref><ref type="bibr">31,</ref><ref type="bibr">27,</ref><ref type="bibr">34]</ref> on the designability of RNA, i.e., x is a correct design if and only if y is the only MFE structure of x, which we call unique MFE (uMFE) criterion to di!erentiate it from the traditional MFE criterion. Formally, uMFE(x) = y &#969; if and only if &#8595;y &#8594; Y(x) and y &#8596; = y &#969; , #G &#8594; (x, y &#969; ) &lt; #G &#8594; (x, y).</p><p>(3) Following previous work <ref type="bibr">[14,</ref><ref type="bibr">35]</ref> on undesignability, we define the undesignability based the uMFE criterion, i.e., the condition in Eq.3 can not be satisfied for any RNA sequence x given a target structure y &#969; . Alternatively, we give the formal definition of undesignability as follows.</p><p>Definition 1. An RNA secondary structure y &#969; is undesignable by uMFE criterion if and only if &#8595;x, &#8600;y</p><p>3 Motif and its Undesignability</p><p>Recent work, RIGEND <ref type="bibr">[35]</ref>, has demonstrated that some structures (puzzles) in Eterna100 are undesignable. For instance, the puzzle "[RNA] Repetitive Seqs. 8/10" in Fig. <ref type="figure">1</ref> is proven undesignable because there exists a rival structures always better a lower free energy than the target structure . However, this level of explainability remains at the global structure level rather than more specific local regions. Ideally, a solution should not only verify that a structure is undesignable but also pinpoint local regions as small as possible within structures that causes the undesignability. We refer to these regions as undesignable motifs.</p><p>Smaller undesignable motifs are easier to be resused to detect undesignability in structures. Thus, the goal is to identify minimal undesignable motifs from structures. For example, our proposed algorithm identified two minimal undesignable motifs in the puzzle "[RNA] Repetitive Seqs. 8/10", highlighted in Fig. <ref type="figure">1</ref>. Previous e!orts <ref type="bibr">[31,</ref><ref type="bibr">30]</ref> of exhaustive search fails at them due to scalability and deficit of motif definition.</p><p>We argue that the most e!ective way to define a motif is by focusing on the composition of loops, aligning with how RNA structures are composed of di!erent loop types (Section 2). In this way, a motif is regarded a full generalization of RNA structure. Our formal definition of motifs is introduced in the following subsection.  Many functions defined for secondary structures can also be applied to motifs. For example, loops(m) represents the set of loops within a motif m, while pairs(m) and unpaired (m) represent the sets of base pairs and unpaired positions, respectively. We define the cardinality of m as the number of loops in m, i.e., card (m) = |loops(m)|. Fig. <ref type="figure">3</ref> illustrates three motifs, m 1 , m 2 , and m 3 , in a structure adapted from the Eterna puzzle "Cat's Toy". These motifs contain 1, 2, and 3 loops, respectively. We also define the length of a motif |m| as the number of bases it contains, which is consistent with the length of a secondary structure |y|.</p><p>Since motifs are defined as sets of loops, we can conveniently use set relations to describe their interactions. A motif m A is a sub-motif of another motif m B if m A is contained within m B , denoted as m A &#8771; m B . For the motifs in Fig. <ref type="figure">3</ref>, we observe the relation</p><p>The entire structure y can be regarded as the largest motif within itself, and accordingly, m &#8771; y signifies that motif m is a part of structure y, with m &#8656; y implying m is strictly smaller than y.</p><p>The loops in a motif m are connected by base pairs. Each base pair in pairs(m) is classified as either an internal pair linking two loops in m or a boundary pair connecting one loop inside m to one outside. These two types of pairs in m are denoted as disjoint sets ipairs(m) and bpairs(m), respectively:</p><p>Utilizing the commonly accepted nearest neighbor model for RNA folding, it becomes evident that certain motifs may be absent from structures folded from RNA sequences. For instance, motif m 3 in Fig. <ref type="figure">3</ref> is considered undesignable, as the removal of its two internal pairs consistently reduces the free energy. This brings us to the definition of an undesignable motif.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Motif Ensemble from Constrained Folding</head><p>The designability of motifs is based on constrained folding. Given a sequence x, a structure in its ensemble y &#8594; Y(x), we can conduct constrained folding by constraining the loops outside m, i.e., c = y \ m. We generalize the concept of (structure) ensemble to motif ensemble as the set of motifs that x can possibly fold into given the constraint c, denoted as M(x, c). Motifs in M(x, y \ m) have the same boundary pairs, i.e.,</p><p>The free energy of a motif m is the sum of the free energy of the loops in m,</p><p>The definitions of MFE and uMFE can also be generalized to motifs via constrained folding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Undesignability of Motif</head><p>The designability and undesignability of motifs by uMFE criterion can be defined based on Def. 4.</p><p>Similarly, we can establish the definition of designable motifs.</p><p>Moreover, if m is undesignable, any motif or structure containing m will be undesignable.</p><p>We can construct a motif m &#8593;&#8593; &#8596; = m by substituting the loops of m &#969; within m with m &#8593;&#8593; such that loops(m</p><p>Corollary 1. If a motif m &#969; is undesignable, then any structure y such that m &#969; &#8771; y is also undesignable.</p><p>Proof. The structure y is the largest motif in y. Thus the correctness of Corollary 1 follows Theorem 1.</p><p>Corollary 2. If a motif m is designable, then any sub-motif m sub &#8656; m is also designable. As a special case, if a structure y is designable, then any motif m within y, i.e., m &#8656; y is also designable.</p><p>Proof. This follows as the contrapositive of Theorem 1, thus completing the proof.</p><p>According to Theorem 1, we can access the undesignability of a big motif by inspecting its sub-motifs. Therefore, it is crucial and valuable to determine the minimality of an undesignable motif. We introduce the concept of minimal undesignable motif. Definition 7. A motif m is a minimal undesignable motif if and only if the two conditions both hold: (1) m is an undesignable motif, and (2) &#8595;m &#8593; &#8656; m, m &#8593; is designable.</p><p>By this definition, the motif m 3 in Fig. <ref type="figure">3</ref> is a minimal undesignable motif because all its proper submotifs are designable. Since the minimality is based on the concept of loops, one fundamental question is that what's the least number of loops a minimal undesignable motif can contain. Therefore, it is worthwhile to prove that any motif composed of a single loop is designable.</p><p>Theorem 2. If a motif m &#969; is composed of one loop, i.e., card (m &#969; ) = 1, then m &#969; is designable.</p><p>Proof is detailed in Supplementary Section F.2. It turns out that it is possible for a two-loop motif to be minimal undesignable (as illustrated in Table <ref type="table">3</ref>). The primary issue is how to identify undesignable motifs. A trivial solution is to enumerate all the (partial) sequences for the target motif m &#969; and check the when m &#969; is a uMFE motif after constrained folding. However, it is impractical for long motifs because of exponentially high time cost. See Supplementary Section D for a detailed discussion. The existing work CountingDesign <ref type="bibr">[31,</ref><ref type="bibr">30]</ref> is better than exhaustively enumerating sequences for each motif one by one, yet in essence it is still an exhaustive enumeration method. As a result, it cannot scale to long motifs and the found undesignable motifs lack interpretability. To provide scalability but also interpretability, we borrow the philosophy of rival structure from RIGEND <ref type="bibr">[35]</ref> and propose to utilize rival motif to establish the undesignability of motifs, which is discussed in the following Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Rival Motifs Identification</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Identify Single Rival Motif</head><p>It is possible that there is another motif m &#8593; which always has lower free energy than the target motif m &#969; , if we can find such a rival motif, m &#969; is undesignable. For instance, removing the internal pairs of the motif m 3 highlighted in Fig. <ref type="figure">3</ref> will yield a rival motif m &#8593; that is always energetically favored than m 3 , proving m 3 in Fig. <ref type="figure">3</ref> undesignable by the following theorem. Another example of single rival motif is shown in Fig. <ref type="figure">4b</ref>.</p><p>, then m &#969; is undesignable. The correctness of Theorem 3 follows Def. 5. According to RIGEND <ref type="bibr">[35]</ref>, the energy di!erence of two structures ##G &#8594; is totally decided by their di!erential positions, which can also be applied to two motifs. The condition in Theorem 3 can be written as</p><p>where</p><p>denotes the di!erential positions 1 , and</p><p>denotes the free energy di!erence between m and m &#969; . Therefore, we can verify a single rival motif by inspecting possible nucleotides on only these di!erential positions. Refer to the RIGEND <ref type="bibr">[35]</ref> for details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Identify Multiple Rival Motifs</head><p>For a rival motif m &#8593; satisfying Theorem 3, we have pairs(m &#8593; ) &#8656; pairs(m &#969; ) (proven in Supplementary Section E). However, if any motif m &#8593; satisfying pairs(m &#8593; ) &#8656; pairs(m &#969; ) is not a qualified rival motif, more rival motifs will be required as evidence for undesignability.</p><p>The right side of Fig. <ref type="figure">4</ref> shows an example of target motif with multiple rival motifs. Identifying a set of multiple rival motifs is more complicated than finding a single rival motif. We present a high-level Algorithm 1 to elucidate the fundamental procedures involved in identifying rival motifs, providing readers with an overview of the essential steps. In Algorithm 1 there are three parameters M, N, and K. They limit the number of di!erential positions, rival structures and sampled sequences, preventing the algorithm running forever. The overall complexity is O(NM + NK|m| 3 ). We omit the intricate details for conciseness, and encourage readers to consult the literature of RIGEND <ref type="bibr">[35]</ref> for a comprehensive description. Fig. <ref type="figure">4</ref>: Example of target motif and rival motif(s). The target motif 4a is from the structure in Fig. <ref type="figure">1</ref>, the target motif 4c is from Eterna100 puzzle "Mat -Elements &amp; Sections" as plotted in Table Fig. <ref type="figure">3</ref>.</p><p>if |M rival | &gt; N then return unkown &#969; early stop 5:</p><p>if m &#8594;&#8594; = uMFE(x, y \ m &#969; ) then return designable &#969; found successful uMFE design 8:</p><p>m &#8594;&#8594; &#8595; MFE(x) &#969; one of MFE structures in case of multiple 9:</p><p>if 4 |!(m &#8594; ,m &#969; )| &lt; M then M rival &#8595; M rival &#8593; {m &#8594;&#8594; } &#969; limit the size of di!erential positions 10: return undesignable 5 Rotational Invariance</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Invariance of Motif Energy</head><p>In the nearest-neighbor model, the free energy change of a loop is independent of its absolute position or specific orientation within the structure. As a result, the free energy of motifs adheres to both translational invariance and rotational invariance, though not to symmetrical invariance. Fig. <ref type="figure">5</ref> shows two groups of equivalent motifs found in the Eterna100 puzzles, demonstrating rotational invariance. The motifs m a , m b , and m c come from puzzles with IDs #60, #81, and #88, respectively, while m d and m e are from the puzzle with ID #86 (refer to Table <ref type="table">3</ref> for detailed structures with highlighted motifs). Among them, m a and m b can be rotated into each other, as can m d and m e . However, despite both containing three bulge loops, m b and m c cannot be rotated into one another.</p><p>To identify and represent such rotational equivalence, we propose a noval representation for motifs, referred to as loop-pair graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Loop-Pair Graph</head><p>Definition 8. A loop-pair graph for a pseudoknot-free RNA secondary structure y is a weighted undirected graph G(y) = &#8660;V (y), E(y)&#8598; where each node v &#8594; V is either a loop in y or a pair in y or the pseudo-pair node r (representing 5 &#8593; and 3 &#8593; ends rather than a base pair), (i.e., V (y) = loops(y) &#8657; pairs(y) &#8657; {r}) and each edge e = (u, v, w) &#8594; E(y) connects one loop node and one pair node, with the edge weight w denoting the number of unpaired bases of the segment of the loop between the connected pair and next pair according to the direction from 5 &#8593; to 3 &#8593; . For each loop node v, there is an ordered list N (v) of neighbor nodes.</p><p>The loop-pair graphs of the structure and motifs in Fig. <ref type="figure">3</ref> is shown in Fig. <ref type="figure">6c</ref>. We also show Fig. <ref type="figure">6a</ref> and Fig. <ref type="figure">6b</ref> for comparison. The advantages of loop-pair graphs include: 5 ' e5 e' e3 e' (a) Polymer graph B E S M H S M S H S H (b) Loop graph &#8733; B p E r p S p M p H p S p M p S p H p S p H 4 4 4 0 0 0 2 2 9 6 0 0 3 3 0 0 4 8 0 0 4 (c) Loop-pair graph &#969; rotate 5: return t 1. Bijective. A loop-pair graph contains all necessary information about loops, base pairs, and unpaired bases to fully reconstruct the original structure. In contrast, loop-graph in Fig. <ref type="figure">6b</ref> or RNA-as-graph representations <ref type="bibr">[16,</ref><ref type="bibr">11,</ref><ref type="bibr">10,</ref><ref type="bibr">36]</ref> cannot recover the original structure due to missing information about unpaired bases or helices, respectively. ignore helix stackings and can not recover the original structure. 2. Abstract. Loop-pair graphs emphasize the connections and topology of loops and base pairs within motifs, while abstracting away the finer details of the backbone structure. While other representations <ref type="bibr">[5,</ref><ref type="bibr">19,</ref><ref type="bibr">18]</ref> also provide abstraction of structures, they serve for di!erent applications. 3. Compact. The number of unpaired bases is encoded as edge weights, which makes the representation more space-e"cient than the polymer graph in Fig. <ref type="figure">6a</ref>.</p><p>RNA secondary structures are inherently recursive, making loop-pair graphs singly connected, essentially forming a tree. Any motif m &#8771; y corresponds to an induced subgraph of G(y), containing the loop nodes for each loop in loops(m), the pair nodes for each pair in pairs(m), and all edges connecting these nodes. To assess the uniqueness of motifs under rotational isomorphism, we recursively rotate the graph representation of a motif using a boundary pair node as a pivot, as described in Algorithm 2 (linear complexity).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Bottom-Up Scan of Motif Designability within RNA Structures</head><p>An ideal algorithm should be capable of identifying all minimal undesignable motifs within a given secondary structure. It is important to note that the sub-motifs of M always possess smaller cardinalities compared to M. Therefore, a straightforward approach involves enumerating all motifs in the structure according to their cardinality and determining whether each motif is designable or undesignable, as outlined in Algorithm 3. This algorithm assumes the existence of an oracle function, DECIDE(m), which returns whether m is designable or undesignable. While such an ideal function DECIDE(m) theoretically exists (exhaustive search Algorithm 3 Fully Bottom-Up Framework for Identifying Minimal Undesignable Motif</p><p>1: function BottomUpMotifDesignabilityScan(y) &#969; input a secondary structure y 2: M miniundesignable &#8595; &#8843; &#969; a set to store identified minimal undesignable motifs 3: for all non-singleton m &#8600; y in increasing order of card (m) do &#969; card (m) = 2, 3, . . . , |loops(y)| 4: if &#8771;m &#8594; &#8599; M miniundesignable and m &#8594; &#8656; m then continue &#969; undesignable but not minimal 5: designablity &#8595; Decide(m) &#969; either designable or undesignable 6: if designablity = undesignable then M miniundesignable &#8595; M miniundesignable &#8593; {m} 7: return M miniundesignable Algorithm 4 FastMotif (see Fig. 7 for an example of powerset) 1: global: M miniundesignable , M designable &#969; Global (minimal) undesignable and designable motifs 2: function FastMotif(y) &#969; Input is a structure 3: for all loop node u &#8599; G(y) do 4: for all non-empty s &#8599; 2 N (u) in increasing order of |s| do &#969; 2 N (u) : powerset of N (u);|s| = 1, 2, . . . , |N (u)| 5: m &#8595; {u} &#8593; s &#969; motif m has loop u and its neighbors in s, so card (m) &#8658; 2 6: if m &#8599; M miniundesignable then continue &#969; check every rotated version of m; already identified 7: else 8: if &#8771;m &#8594; &#8599; M miniundesignable and m &#8594; &#8656; m then continue &#969; undesignable but not minimal 9: designablity &#8595; RivalMotifSearch(m) &#969; return designable or undesignable or unkown 10: if designablity = designable then M designable &#8595; M designable &#8593; {m} 11: else if designablity = undesignable then 12: if &#8657;m sub &#8656; m, m sub &#8599; M designable then M miniundesignable &#8595; M miniundesignable &#8593; {m} &#969; minimal</p><p>could achieve this), it may not be practical due to its high complexity. Nevertheless, it serves as a conceptual foundation for developing more practical algorithms.</p><p>Theorem 5. Given a secondary structure y, Algorithm 3 outputs a set M miniundesignable containing all and only the minimal undesignable motifs in y.</p><p>The proof is conducted by induction (detailed in Supplementary Section F.1). The total number of motifs in a structure can grow exponentially with cardinalities, making Algorithm 3 impractical for large structures. However, empirical observations suggest minimal undesignable motifs typically involve a small number of loops. To address this, we propose a variant of Algorithm 3, referred to as FastMotif (Algorithm 4), designed to identify as many minimal undesignable motifs as possible within a given structure y, while maintaining computational e"ciency. In particular, we limit our evaluation to motifs composed of a loop and any (nonempty) subset of its neighboring loops (Fig. <ref type="figure">7</ref>). This approach o!ers several key advantages:</p><p>1. Each motif has a limited number of loops, making undesignability easier to decide. 2. For a loop v with k neighbor loops, i.e., |N (v)| = k, the size of N (v)'s power set 2 N (v) , excluding &#8843;, is 2 k &#8659; 1. Since most loops have 2 or 3 neighbors, the number of motifs considered remains relatively small, while still covering most of the relevant small motifs. 3. Each loop and its neighbor loops can be seen as a small subgraph on the loop-pair graph G(y). Enumerating motifs in the power set would be equivalent to running Algorithm 3 on a local subgraph.</p><p>To enhance performance, we exclude motifs with more than 3 neighbor loops. The complexity of FastMotif is determined by the product of the number of motifs scanned and the complexity of Algorithm 1, making it polynominal. Additionally, FastMotif can be adapted to scan larger motifs. In Sec.8. We incorporated an undesignable substructure in Eterna puzzles ("Chicken feet" and "Mutated chickenfeet".) from RIGEND <ref type="bibr">[35]</ref> and proved it is a minimal undesignable motif.</p><p>7 Related Work \ &#8843;, i.e., {l 1 }, {l 2 }, {l 3 }, {l 1 , l 2 }, {l 2 , l 3 }, {l 3 , l 1 }, {l 1 , l 2 , l 3 }, only {l 2 , l 3 } and {l 1 , l 2 , l 3 }, when combined with l, are undesignable, but only {l, l 2 , l 3 } is minimal undesignable (see Tab. 3 for Eterna #57).</p><p>To the best of our knowledge, CountingDesign <ref type="bibr">[31,</ref><ref type="bibr">30]</ref> is the existing method that has investigated undesignable motifs. However, it exhaustively enumerates and folds all RNA sequences for each motif group, identifying designable motifs and taking their complement to find undesignable ones. As a result, it is only </p><p>Fig. <ref type="figure">8</ref>: A minimal undesignable motif with two rotational variants (top) found in 5S rRNA and SRP families.</p><p>applicable to short motifs, reporting motifs of up to 14 nucleotides. Another drawback is the limited interpretability of the undesignable motifs, as CountingDesign identifies them by taking the complement of designable motifs rather than directly characterizing the undesignable motifs themselves. Moreover, Count-ingDesign defines motifs as rooted trees starting from a base pair, making it unable to handle external loops in RNA structures. Additionally, it overlooks rotational invariance of motifs, leading to redundancy in the identified undesignable motif set. See also Sec. 8.3 and Supplementary Sec. C for e"ciency comparisons.</p><p>8 Empirical Results</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Settings</head><p>We applied our algorithm FastMotif to two public RNA structure benchmarks: Eterna100 <ref type="bibr">[2]</ref> and ArchiveII <ref type="bibr">[8]</ref>. Eterna100 consists of 100 structures<ref type="foot">foot_0</ref> , artificially designed by human experts, and serves as a primary benchmark for RNA design. ArchiveII, comprising native RNA structures, spans 10 families <ref type="bibr">[13,</ref><ref type="bibr">25,</ref><ref type="bibr">9,</ref><ref type="bibr">7,</ref><ref type="bibr">24,</ref><ref type="bibr">38]</ref> of naturally occurring RNA and is used to evaluate RNA folding <ref type="bibr">[15]</ref>. <ref type="foot">3</ref> Following prior work in RNA design <ref type="bibr">[12,</ref><ref type="bibr">22,</ref><ref type="bibr">4,</ref><ref type="bibr">34]</ref> and undesignability <ref type="bibr">[35,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">1]</ref>, the RNA folding model is ViennaRNA (v2.5.1) <ref type="bibr">[20]</ref>. Our code was written in C++, utilizing a 3.40 GHz Intel Xeon E3-1231 CPU and 32 GB memory. Parallelization was achieved by OpenMP for 8 CPU cores. The parameters during rival motif search (Alg. 1) are set as M = 10 10 , N = 10 5 , K = 100. Our source code is available at <ref type="url">https://github.com/shanry/RNA-Undesign</ref>. 8.2 Undesignable Structures and Unique Minimal Undesignable Motifs 5' 3'</p><p>Fig. <ref type="figure">9</ref>: Undesignable tRNA structure/motif. Table <ref type="table">2</ref> summarizes the statistics of undesignable structures and (minimal) undesignable motifs, as well as the running time, for both Eterna puzzles and native structures from ArchiveII. Among Eterna100 puzzles, we found 24 unique minimal undesignable motifs (35 occurrences), resulting in 18 undesignable puzzles (Table . 3). This result is stronger than that of RIGEND, which identified 16 undesignable puzzles along with one rival (sub-)structure for each puzzle.</p><p>The structures in ArchiveII, on the other hand, are high-quality native structures, which intuitively should be designable. Surprisingly, there are over 1,000 occurrences of undesignable motifs from almost all ArchiveII families (except for Group II Intron). In total, we found 331 unique minimal undesignable motifs in ArchiveII, some of them shared across families, and 663 (17%) undesignable structures. For example, Fig. <ref type="figure">8</ref> shows a minimal undesignable motifs shared across two families and Fig. <ref type="figure">9</ref> shows the only undesignable tRNA structure and motif. See Supplementary Section H for more motif examples in 16S &amp; 23S rRNA structures. We suspect the large number of undesignable structures and motifs are due to the energy model (Vienna 2) being imperfect and pseudoknots playing a role in designability which is beyond our work.</p><p>Interestingly, no motifs are shared between Eterna100 and ArchiveII, and in total we found 355 unique minimal undesignable motifs, which can be browsed on our web server <ref type="url">http://linearfold.org/motifs</ref>. Users can even upload a new structure and the server will find minimal undesignable motifs in it on the fly. 7) 3 5' 3' 5' 3' 60: Mat -Elements &amp; Sections 1 5' 3' 61: Chicken feet 2 5' 3' 67: Simple Single Bond 1 5' 3' 72: Loop next to a Multiloop 1 5' 3' 78: Mat -Lot 2-2 B 1 5' 3' 80: Spiral of 5's 1 5' 3' 81: Campfire 1 5' 3' 86: Methaqualone C16H14N2O 3 5' 3' 87: Cat's Toy 2 2 5' 3'</p><p>88: Zigzag Semicircle 1</p><p>91: Thunderbolt</p><p>3 5' 3' 92: Mutated chicken feet 3 5' 3' 5' 3' 96: Cesspool</p><p>8 5' 3' 5' 3' 99: Shooting Star 1 5' 3'</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">E!ciency</head><p>From Table <ref type="table">2</ref>, we can see the algorithm costs only a few seconds or minutes to scan an entire structure. Such e"ciency is much superior to the previous work CountingDesign. As a comparison, identifying undesignable motifs of length 39, the average length of minimal undesignable motifs identified in our work, would take thousands of years per motif using CountingDesign (Fig. <ref type="bibr">SI 2)</ref>. We also apply our rival motif search algorithm (Algorithm 1) to all motifs up to length 14 to compare with CountingDesign. We successfully identified all the minimal undesignable motifs (total: 4561; unique: 1805) in 0.7 hours, whereas CountingDesign takes more than a week. See Supplementary Section C for detailed comparison with CountingDesign.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">Conclusions and Future Work</head><p>We introduced a theoretical framework for loop-based motifs, and fast algorithms with a loop-pair graph representation to identify unique minimal undesignable motifs in RNA structures. By searching for rival motifs, the undesignability of motifs can be e"ciently confirmed and explicitly explained. Future exploration could involve implementing DFS/BFS-based algorithms to search for a broader range of undesignable motifs.</p><p>The results with ArchiveII suggest the current thermodynamic parameters are deficient. We hypothesize that improvements in parameterization <ref type="bibr">[37,</ref><ref type="bibr">28]</ref> could be made, particularly from the perspectives of undesignability. Future work could involve comparing the sets of minimal undesignable motifs using alternative parameter sets beyond those implemented in the ViennaRNA package, including comprehensive parameterizations that account for coaxial stacking or that are informed by experimentally known structures <ref type="bibr">[29,</ref><ref type="bibr">3,</ref><ref type="bibr">23]</ref>.</p><p>A Loops in RNA Structures Table SI 1: Critical positions of loops in Fig. 2 Loop Type Critical Positions Closing Pairs Mismatches (Unpaired) External (3, 50) 2, 51 Stack (3, 50), (4, 49) Stack (4, 49), (5, 48) Multi (5, 48), (9, 24), (28, 44) 4, 49, 8, 25, 27, 45 Stack (9, 24), (10, 23) Bulge (10, 23), (11, 19) Stack (11, 19), (12, 18) Hairpin (12, 18) 13, 17 Stack (28, 44), (29, 43) Internal (29, 43), (32, 39) 30, 42, 31, 40 Stack (32, 39), (33, 38) Hairpin (33, 38) 34, 37</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multiloop</head><p>Table SI 2: Critical positions for each type of loops under Turner model implemented in ViennaRNA Special hairpins [21] of triloops, tetraloops and hexaloops are not considered here Loop Type Critical Positions Closing Pairs Mismatches External (i1, j1), (i2, j2), . . . , (i k , j k x &#8595; map() 3: for i in I do 4:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Projection and Intersection</head><p>x[i] &#8595; xi &#969; Project the i-th nucleotide to index i 5: return x ii T. Zhou et al. 4 6 8 10 12 14 Motif Length 0 20 40 60 80 100 120 140 160 Running Time (hours) CountingDesign FastMotif (This Work) Cumulative Runtime CountingDesign 1.17 weeks FastMotif 0.7 hours Fig. SI 2: Running time comparison between FastMotif (this work) and CountingDesign for evaluating motifs of lengths up to 14. Algorithm 6 Contraint Intersection C &#8593; = Intersection(C 1 , C 2 ) 1: function Intersection(C1, C2) &#969; C1, C2 are sets of constraints 2: (I1, X1) &#8595; C1 &#969; I contains critical positions and X is a set of nucleotides compositions 3: (I2, X2) &#8595; C2 4: I &#8594; &#8595; I1 &#8660; I2 5: if I &#8594; = &#8843; then &#969; No overlapping positions; return original constraints 6: return C1, C2 7: X&#8594; 1 &#8595; {x &#8598; I &#8594; | x &#8599; X1} 8: X&#8594; 2 &#8595; {x &#8598; I &#8594; | x &#8599; X2} 9: for x &#8599; X1 do &#969; Remove nucleotides compositions from X1 that is not in X2 10: if x &#8598; I &#8594; / &#8599; X&#8594; 2 then X1 &#8595; X1 \ {x} 11: for x &#8599; X2 do &#969; Remove nucleotides compositions from X2 that is not in X1 12: if x &#8598; I &#8594; / &#8599; X&#8594; 1 then X2 &#8595; X2 \ {x} 13: C &#8594; 1 &#8595; (I1, X1) 14: C &#8594; 2 &#8595; (I2, X2) 15: return C &#8594; 1 &#8593; C &#8594; 2 &#969; Return updated constraints</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Comparison with CountingDesign</head><p>We compared FastMotif with the original CountingDesign <ref type="bibr">[31]</ref> program<ref type="foot">foot_3</ref> to identify all undesignable motifs up to 14 nucleotides in length under the same setting (3.40 GHz Intel Xeon E3-1231 CPU and 32 GB of memory, with parallelization of 8 CPU cores). Fig. SI 2 shows the total running times of the two methods for identifying undesignable motifs of di!erent lengths. Both methods found all all the undesignable motifs (total: 4561; unique: 1805) and the designable motifs up to length of 14. However, the time cost of CountingDesign increases exponentially with motif lengths, highlighting its impracticality for longer motifs.</p><p>In contrast, FastMotif only need 0.7 hours. More importantly, FastMotif identified a set of rival motifs for each undesignable motif, which is explainable and helpful for further understanding RNA folding. Therefore, FastMotif demonstrates significant advantages in terms of not only scalability but also interpretability. 5' 3' Fig. SI 32: F.ananassa, 16S rRNA 5' 3' Fig. SI 33: F.ananassa, 16S rRNA 5' 3' Fig. SI 34: G.intestinalis, 16S rRNA 5' 3' Fig. SI 35: H.sapiens, 16S rRNA 5' 3' Fig. SI 36: H.sapiens, 16S rRNA 5' 3' Fig. SI 37: H.sapiens.mito, 16S rRNA 5' 3' Fig. SI 38: H.volcanii, 16S rRNA xii T. Zhou et al. 5' 3' Fig. SI 40: P.occultum, 16S rRNA 5' 3' Fig. SI 41: P.staleyi, 16S rRNA 5' 3' Fig. SI 42: S.cerevisiae, 16S rRNA 5' 3' Fig. SI 43: T.gondii, 16S rRNA 5' 3' Fig. SI 44: A.pyrophilus, 16S rRNA 5' 3' Fig. SI 45: T.maritima, 16S rRNA 5' 3' Fig. SI 46: Z.mays, 16S rRNA Supplementary Information xiii 5' 3' Fig. SI 47: B.burgdorferi, 16S rRNA 5' 3' Fig. SI 48: B.subtilis, 16S rRNA 5' 3' Fig. SI 49: C.elegans, 16S rRNA 5' 3' Fig. SI 50: C.psittaci, 16S rRNA 5' 3' Fig. SI 51: C.reinhardtii.chloro, 16S rRNA 5' 3' Fig. SI 52: C.reinhardtii.mito, 16S rRNA 5' 3' Fig. SI 53: D.melanogaster, 16S rRNA 5' 3' Fig. SI 54: E.coli, 16S rRNA xiv T. Zhou et al. H.3 23S rRNA Motifs and Structures 5' 3' Fig. SI 55: B.subtilis, 23S rRNA 5' 3' Fig. SI 56: B.subtilis, 23S rRNA 5' 3' Fig. SI 57: B.subtilis, 23S rRNA 5' 3' Fig. SI 58: E.coli, 23S rRNA Supplementary Information xv 5' 3' Fig. SI 59: E.coli, 23S rRNA 5' 3' Fig. SI 60: E.coli, 23S rRNA 5' 3' Fig. SI 61: H.pylori, 23S rRNA 5' 3' Fig. SI 62: H.pylori, 23S rRNA xvi T. Zhou et al. 5' 3' Fig. SI 63: S.aureus, 23S rRNA 5' 3' Fig. SI 64: S.aureus, 23S rRNA 5' 3' Fig. SI 65: S.aureus, 23S rRNA 5' 3' Fig. SI 66: T.thermophilus, 23S rRNA</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>We used the 21 structures that do not have successfully designed sequences by unique MFE criterion</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>Pseudoknots are removed before running our algorithm (tRNA, 5S rRNA, and SRP do not contain pseudoknots).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>This motif is proven undesignable if we ignore heuristic energies of special hairpins.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_3"><p>https://gitlab.com/htyao/countingdesign</p></note>
		</body>
		</text>
</TEI>
