skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improved functions for nonlinear sequence comparison using SEEKR
SEquence Evaluation throughk-mer Representation (SEEKR) is a method of sequence comparison that uses sequence substrings calledk-mers to quantify the nonlinear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimateP-values that ascribe statistical significance to SEEKR-derived similarities, as well as visualize different aspects ofk-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that containXIST-like sequence features, and we demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and we outline the best practice of visualizing RNA-seq read density to evaluate support for lncRNA annotations before their in-depth study in cell types of interest.  more » « less
Award ID(s):
2228805
PAR ID:
10578059
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
RNA
Date Published:
Journal Name:
RNA
Volume:
30
Issue:
11
ISSN:
1355-8382
Page Range / eLocation ID:
1408 to 1421
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Long noncoding RNA (lncRNA) genes outnumber protein coding genes in the human genome and the majority remain uncharacterized. A major difficulty in generalizing understanding of lncRNA function is the dearth of gross sequence conservation, both for lncRNAs across species and for lncRNAs that perform similar functions within a species. Machine learning based methods which harness vast amounts of information on RNAs are increasingly used to impute certain biological characteristics. This includes interactions with proteins that are important mediators of RNA function, thus enabling the generation of knowledge in contexts for which experimental data are lacking. Here, we applied a natural language-based machine learning approach that enabled us to identify RNA binding protein interactions in lncRNA transcripts, using only RNA sequence as an input. We found that this predictive method is a powerful approach to infer conserved binding across species as distant as human and opossum, even in the absence of sequence conservation, thus informing on sequence-function relationships for these poorly understood RNAs. 
    more » « less
  2. Background: Long non-coding Ribonucleic Acids (lncRNAs) can be localized to different cellular compartments, such as the nuclear and the cytoplasmic regions. Their biological functions are influenced by the region of the cell where they are located. Compared to the vast number of lncRNAs, only a relatively small proportion have annotations regarding their subcellular localization. It would be helpful if those few annotated lncRNAs could be leveraged to develop predictive models for localization of other lncRNAs. Methods: Conventional computational methods use q-mer profiles from lncRNA sequences and train machine learning models such as support vector machines and logistic regression with the profiles. These methods focus on the exact q-mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these variabilities might improve our ability to model lncRNAs and their localization. Thus, we build on inexact q-mers and use machine learning/deep learning techniques to study three specific problems in lncRNA subcellular localization, namely, prediction of lncRNA localization using inexact q-mers, the issue of whether lncRNA localization is cell-type-specific, and the notion of switching (lncRNA) genes. Results: We performed our analysis using data on lncRNA localization across 15 cell lines. Our results showed that using inexact q-mers (with q = 6) can improve the lncRNA localization prediction performance compared to using exact q-mers. Further, we showed that lncRNA localization, in general, is not cell-line-specific. We also identified a category of LncRNAs which switch cellular compartments between different cell lines (we call them switching lncRNAs). These switching lncRNAs complicate the problem of predicting lncRNA localization using machine learning models, showing that lncRNA localization is still a major challenge. 
    more » « less
  3. Abstract Motivation Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources. Results We developed a novel computational method called ‘hierarchical representative set selection’ to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4× runtime reduction and 5.35× memory reduction for 10 000 and 12 000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA. Availability and implementation The code is available at https://github.com/Kingsford-Group/hierrepsetselection and https://github.com/Kingsford-Group/jellyfishsim. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Abstract How the noncoding genome affects cellular functions is a key biological question. A particular challenge is to distinguish the effects of noncoding DNA elements from long noncoding RNAs (lncRNAs) that coincide at the same loci. Here, we identified the flowering‐associated intergenic lncRNA (FLAIL) inArabidopsisthrough early floweringflailmutants. Expression ofFLAILRNA from a different chromosomal location in combination with strand‐specific RNA knockdown characterizedFLAILas a trans‐acting RNA molecule.FLAILdirectly binds to differentially expressed target genes that control flowering via RNA–DNA interactions through conserved sequence motifs.FLAILinteracts with protein and RNA components of the spliceosome to affect target mRNA expression through co‐transcriptional alternative splicing (AS) and linked chromatin regulation. In the absence ofFLAIL, splicing defects at the direct FLAIL target flowering gene LACCASE 8 (LAC8) correlated with reduced mRNA expression. Double mutant analyses support a model whereFLAIL‐mediated splicing of LAC8 promotes its mRNA expression and represses flowering. Our study suggests lncRNAs as accessory components of the spliceosome that regulate AS and gene expression to impact organismal development. 
    more » « less
  5. Abstract MotivationK-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. ResultsWe derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementationA python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less