Abstract Motivation Predicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5′-to-3′) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models. Availability and implementation Our source code is available at https://github.com/LinearFold/LinearFold, and our webserver is at http://linearfold.org (sequence limit: 100 000nt). Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
RNA language models predict mutations that improve RNA function
Structured RNA lies at the heart of many central biological processes, from gene expression to catalysis. RNA structure prediction is not yet possible due to a lack of high-quality reference data associated with organismal phenotypes that could inform RNA function. We present GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural and functional analysis anchored to the Genome Taxonomy Database (GTDB). GARNET links RNA sequences to experimental and predicted optimal growth temperatures of GTDB reference organisms. Using GARNET, we develop sequence- and structure-aware RNA generative models, with overlapping triplet tokenization providing optimal encoding for a GPT-like model. Leveraging hyperthermophilic RNAs in GARNET and these RNA generative models, we identify mutations in ribosomal RNA that confer increased thermostability to theEscherichia coliribosome. The GTDB-derived data and deep learning models presented here provide a foundation for understanding the connections between RNA sequence, structure, and function.
more »
« less
- Award ID(s):
- 2002182
- PAR ID:
- 10585163
- Publisher / Repository:
- Springer Nature
- Date Published:
- Journal Name:
- Nature Communications
- Volume:
- 15
- Issue:
- 1
- ISSN:
- 2041-1723
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
RNAs are often studied in nonnative sequence contexts to facilitate structural studies. However, seemingly innocuous changes to an RNA sequence may perturb the native structure and generate inaccurate or ambiguous structural models. To facilitate the investigation of native RNA secondary structure by selective 2′ hydroxyl acylation analyzed by primer extension (SHAPE), we engineered an approach that couples minimal enzymatic steps to RNA chemical probing and mutational profiling (MaP) reverse transcription (RT) methods—a process we call template switching and mutational profiling (Switch-MaP). In Switch-MaP, RT templates and additional library sequences are added postprobing through ligation and template switching, capturing reactivities for every nucleotide. For a candidate SAM-I riboswitch, we compared RNA structure models generated by the Switch-MaP approach to those of traditional primer-based MaP, including RNAs with or without appended structure cassettes. Primer-based MaP masked reactivity data in the 5′ and 3′ ends of the RNA, producing ambiguous ensembles inconsistent with the conserved SAM-I riboswitch secondary structure. Structure cassettes enabled unambiguous modeling of an aptamer-only construct but introduced nonnative interactions in the full-length riboswitch. In contrast, Switch-MaP provided reactivity data for all nucleotides in each RNA and enabled unambiguous modeling of secondary structure, consistent with the conserved SAM-I fold. Switch-MaP is a straightforward alternative approach to primer-based and cassette-based chemical probing methods that precludes primer masking and the formation of alternative secondary structures due to nonnative sequence elements.more » « less
-
ABSTRACT Direct RNA nanopore sequencing allows for the identification of full-length RNAs with a ∼10% error rate consisting of mismatches and small deletions. These errors are thought to be randomly distributed and structure-independent since RNA/cDNA duplexes are generated to prevent RNA structure formation prior to sequencing. When analyzing citrus yellow vein associated virus (CY1) reads during infection ofNicotiana benthamiana,viral (+/-)foldback RNAs (i.e., viral plus [+]-strands joined to [-]-strands) showed significantly higher error rates (mismatches and deletions) in the 5ʹ (+)RNA portion with errors that were relatively evenly distributed, while errors in the attached (-)RNA portion were less frequent and unevenly distributed. Non-foldback CY1 (+)RNAs from infected plants also showed an uneven distribution of errors, which correlated with errors inin vitrotranscribed CY1 (+)RNA reads in both position and frequency. Hotspot errors in non-foldback CY1 (+)RNA and (-)RNA reads only weakly correlated, and hotspots were frequently located 5ʹ of known structural elements. Since nanopore sequencing is also used to identify RNA modifications, which depend on base-specific sequencing errors, algorithms for RNA modification detection were also examined for bias. We found that multiple programs predicted RNA modifications inin vitrotranscribed CY1 RNA at the same positions and with similar confidence levels as within plantaCY1 RNA. These data suggest that direct RNA sequencing contains inherent error biases that may be associated with post-translocation RNA folding and low sequence complexity, and therefore extrapolations based on sequencing error require special consideration.more » « less
-
Abstract Chemical probing technologies enable high-throughput examination of diverse structural features of RNA, including local nucleotide flexibility, RNA secondary structure, protein and ligand binding, through-space interaction networks, and multistate structural ensembles. Deep understanding of RNA structure–function relationships typically requires evaluating a system under structure- and function-altering conditions, linking these data with additional information, and visualizing multilayered relationships. Current platforms lack the broad accessibility, flexibility and efficiency needed to iterate on integrative analyses of these diverse, complex data. Here, we share the RNA visualization and graphical analysis toolset RNAvigate, a straightforward and flexible Python library that automatically parses 21 standard file formats (primary sequence annotations, per- and internucleotide data, and secondary and tertiary structures) and outputs 18 plot types. RNAvigate enables efficient exploration of nuanced relationships between multiple layers of RNA structure information and across multiple experimental conditions. Compatibility with Jupyter notebooks enables nonburdensome, reproducible, transparent and organized sharing of multistep analyses and data visualization strategies. RNAvigate simplifies and accelerates discovery and characterization of RNA-centric functions in biology.more » « less
-
Some RNA viruses package their genomes with extraordinary selectivity, assembling protein capsids around their own viral RNA while excluding nearly all host RNA. How the assembling proteins distinguish viral RNA from host RNA is not fully understood, but RNA structure is thought to play a key role. To test this idea, we perform in-cellulo packaging experiments using bacteriophage MS2 coat proteins and a variety of RNA molecules inEscherichia coli. In each experiment, plasmid-derived RNA molecules with a specified sequence compete against the cellular transcriptome for packaging by plasmid-derived coat proteins. Following this competition, we quantify the total amount and relative composition of the packaged RNA using electron microscopy, interferometric scattering microscopy, and high-throughput sequencing. By systematically varying the input RNA sequence and measuring changes in packaging outcomes, we are able to directly test competing models of selective packaging. Our results rule out a longstanding model in which selective packaging requires the well-known translational repressor (TR) stem-loop, and instead support more recent models in which selectivity emerges from the collective interactions of multiple coat proteins and multiple stem-loops distributed across the RNA molecule. These findings establish a framework for studying and understanding selective packaging in a range of natural viruses and virus-like particles, and lay the groundwork for engineering synthetic systems that package specific RNA cargoes.more » « less
An official website of the United States government

