Riboswitches are conserved structural ribonucleic acid (RNA) sensors that are mainly found to regulate a large number of genes/operons in bacteria. Presently, >50 bacterial riboswitch classes have been discovered, but only the thiamine pyrophosphate riboswitch class is detected in a few eukaryotes like fungi, plants and algae. One of the most important challenges in riboswitch research is to discover existing riboswitch classes in eukaryotes and to understand the evolution of bacterial riboswitches. However, traditional search methods for riboswitch detection have failed to detect eukaryotic riboswitches besides just one class and any distant structural homologs of riboswitches. We developed a novel approach based on inverse RNA folding that attempts to find sequences that match the shape of the target structure with minimal sequence conservation based on key nucleotides that interact directly with the ligand. Then, to support our matched candidates, we expanded the results into a covariance model representing similar sequences preserving the structure. Our method transforms a structure-based search into a sequence-based search that considers the conservation of secondary structure shape and ligand-binding residues. This method enables us to identify a potential structural candidate in fungi that could be the distant homolog of bacterial purine riboswitches. Further, phylogenomic analysismore »
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Nucleic Acids Research
- Page Range or eLocation-ID:
- D212 to D220
- Sponsoring Org:
- National Science Foundation
More Like this
A computational approach for the identification of distant homologs of bacterial riboswitches based on inverse RNA folding
Abstract Motivation Predicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5′-to-3′) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both ofmore »
COI Metabarcoding of Zooplankton Species Diversity for Time-Series Monitoring of the NW Atlantic Continental ShelfMarine zooplankton are rapid-responders and useful indicators of environmental variability and climate change impacts on pelagic ecosystems on time scales ranging from seasons to years to decades. The systematic complexity and taxonomic diversity of the zooplankton assemblage has presented significant challenges for routine morphological (microscopic) identification of species in samples collected during ecosystem monitoring and fisheries management surveys. Metabarcoding using the mitochondrial Cytochrome Oxidase I (COI) gene region has shown promise for detecting and identifying species of some – but not all – taxonomic groups in samples of marine zooplankton. This study examined species diversity of zooplankton on the Northwest Atlantic Continental Shelf using 27 samples collected in 2002-2012 from the Gulf of Maine, Georges Bank, and Mid-Atlantic Bight during Ecosystem Monitoring (EcoMon) Surveys by the NOAA NMFS Northeast Fisheries Science Center. COI metabarcodes were identified using the MetaZooGene Barcode Atlas and Database ( https://metazoogene.org/MZGdb ) specific to the North Atlantic Ocean. A total of 181 species across 23 taxonomic groups were detected, including a number of sibling and cryptic species that were not discriminated by morphological taxonomic analysis of EcoMon samples. In all, 67 species of 15 taxonomic groups had ≥ 50 COI sequences; 23 species had >1,000 COImore »
Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 validmore »
Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.
We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide verymore »
Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.
Availability and implementation
pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.
Supplementary data are available at Bioinformatics online.