skip to main content

Title: RNAcentral 2021: secondary structure integration, improved sequence search and new member databases
Abstract RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that provides a single access point to 44 RNA resources and >18 million ncRNA sequences from a wide range of organisms and RNA types. RNAcentral now also includes secondary (2D) structure information for >13 million sequences, making RNAcentral the world’s largest RNA 2D structure database. The 2D diagrams are displayed using R2DT, a new 2D structure visualization method that uses consistent, reproducible and recognizable layouts for related RNAs. The sequence similarity search has been updated with a faster interface featuring facets for filtering search results by RNA type, organism, source database or any keyword. This sequence search tool is available as a reusable web component, and has been integrated into several RNAcentral member databases, including Rfam, miRBase and snoDB. To allow for a more fine-grained assignment of RNA types and subtypes, all RNAcentral sequences have been annotated with Sequence Ontology terms. The RNAcentral database continues to grow and provide a central data resource for the RNA community. RNAcentral is freely available at
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Award ID(s):
Publication Date:
Journal Name:
Nucleic Acids Research
Page Range or eLocation-ID:
D212 to D220
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Riboswitches are conserved structural ribonucleic acid (RNA) sensors that are mainly found to regulate a large number of genes/operons in bacteria. Presently, >50 bacterial riboswitch classes have been discovered, but only the thiamine pyrophosphate riboswitch class is detected in a few eukaryotes like fungi, plants and algae. One of the most important challenges in riboswitch research is to discover existing riboswitch classes in eukaryotes and to understand the evolution of bacterial riboswitches. However, traditional search methods for riboswitch detection have failed to detect eukaryotic riboswitches besides just one class and any distant structural homologs of riboswitches. We developed a novel approach based on inverse RNA folding that attempts to find sequences that match the shape of the target structure with minimal sequence conservation based on key nucleotides that interact directly with the ligand. Then, to support our matched candidates, we expanded the results into a covariance model representing similar sequences preserving the structure. Our method transforms a structure-based search into a sequence-based search that considers the conservation of secondary structure shape and ligand-binding residues. This method enables us to identify a potential structural candidate in fungi that could be the distant homolog of bacterial purine riboswitches. Further, phylogenomic analysismore »and evolutionary distribution of this structural candidate indicate that the most likely point of origin of this structural candidate in these organisms is associated with the loss of traditional purine riboswitches. The computational approach could be applicable to other domains and problems in RNA research.

    « less
  2. Abstract Motivation Predicting the secondary structure of an ribonucleic acid (RNA) sequence is useful in many applications. Existing algorithms [based on dynamic programming] suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5′-to-3′) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both ofmore »which are well known to be challenging for the current models. Availability and implementation Our source code is available at, and our webserver is at (sequence limit: 100 000nt). Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Marine zooplankton are rapid-responders and useful indicators of environmental variability and climate change impacts on pelagic ecosystems on time scales ranging from seasons to years to decades. The systematic complexity and taxonomic diversity of the zooplankton assemblage has presented significant challenges for routine morphological (microscopic) identification of species in samples collected during ecosystem monitoring and fisheries management surveys. Metabarcoding using the mitochondrial Cytochrome Oxidase I (COI) gene region has shown promise for detecting and identifying species of some – but not all – taxonomic groups in samples of marine zooplankton. This study examined species diversity of zooplankton on the Northwest Atlantic Continental Shelf using 27 samples collected in 2002-2012 from the Gulf of Maine, Georges Bank, and Mid-Atlantic Bight during Ecosystem Monitoring (EcoMon) Surveys by the NOAA NMFS Northeast Fisheries Science Center. COI metabarcodes were identified using the MetaZooGene Barcode Atlas and Database ( ) specific to the North Atlantic Ocean. A total of 181 species across 23 taxonomic groups were detected, including a number of sibling and cryptic species that were not discriminated by morphological taxonomic analysis of EcoMon samples. In all, 67 species of 15 taxonomic groups had ≥ 50 COI sequences; 23 species had >1,000 COImore »sequences. Comparative analysis of molecular and morphological data showed significant correlations between COI sequence numbers and microscopic counts for 5 of 6 taxonomic groups and for 5 of 7 species with >1,000 COI sequences for which both types of data were available. Multivariate statistical analysis showed clustering of samples within each region based on both COI sequence numbers and EcoMon counts, although differences among the three regions were not statistically significant. The results demonstrate the power and potential of COI metabarcoding for identification of species of metazoan zooplankton in the context of ecosystem monitoring.« less
  4. Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 validmore »families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio.« less
  5. Abstract Motivation

    Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.


    We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide verymore »fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.

    Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.

    Availability and implementation

    pufferfish is written in C++11, is open source, and is available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less