skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Predicting conserved functional interactions for long noncoding RNAs via deep learning
Long noncoding RNA (lncRNA) genes outnumber protein coding genes in the human genome and the majority remain uncharacterized. A major difficulty in generalizing understanding of lncRNA function is the dearth of gross sequence conservation, both for lncRNAs across species and for lncRNAs that perform similar functions within a species. Machine learning based methods which harness vast amounts of information on RNAs are increasingly used to impute certain biological characteristics. This includes interactions with proteins that are important mediators of RNA function, thus enabling the generation of knowledge in contexts for which experimental data are lacking. Here, we applied a natural language-based machine learning approach that enabled us to identify RNA binding protein interactions in lncRNA transcripts, using only RNA sequence as an input. We found that this predictive method is a powerful approach to infer conserved binding across species as distant as human and opossum, even in the absence of sequence conservation, thus informing on sequence-function relationships for these poorly understood RNAs.  more » « less
Award ID(s):
2243562 2243666
PAR ID:
10574080
Author(s) / Creator(s):
;
Publisher / Repository:
Frontiers Media
Date Published:
Journal Name:
Frontiers in RNA Research
Volume:
2
ISSN:
2813-7116
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Long noncoding RNAs (lncRNAs) are relatively novel RNAs with increasingly prominent roles in regulating genetic expression, mainly in the nucleus but more recently in regions such as the mitochondrion. This study explores how lncRNAs interact with polynucleotide phosphorylase (PNPase), a protein that regulates RNA import into the mitochondrion. Machine learning identified several RNA structural features that improved lncRNA binding to PNPase, which may be useful in targeting RNA therapeutics to the mitochondrion. 
    more » « less
  2. Simon, Anne E. (Ed.)
    ABSTRACT Long noncoding RNAs (lncRNAs) of virus origin accumulate in cells infected by many positive-strand (+) RNA viruses to bolster viral infectivity. Their biogenesis mostly utilizes exoribonucleases of host cells that degrade viral genomic or subgenomic RNAs in the 5′-to-3′ direction until being stalled by well-defined RNA structures. Here, we report a viral lncRNA that is produced by a novel replication-dependent mechanism. This lncRNA corresponds to the last 283 nucleotides of the turnip crinkle virus (TCV) genome and hence is designated tiny TCV subgenomic RNA (ttsgR). ttsgR accumulated to high levels in TCV-infected Nicotiana benthamiana cells when the TCV-encoded RNA-dependent RNA polymerase (RdRp), also known as p88, was overexpressed. Both (+) and (−) strand forms of ttsgR were produced in a manner dependent on the RdRp functionality. Strikingly, templates as short as ttsgR itself were sufficient to program ttsgR amplification, as long as the TCV-encoded replication proteins p28 and p88 were provided in trans . Consistent with its replicational origin, ttsgR accumulation required a 5′ terminal carmovirus consensus sequence (CCS), a sequence motif shared by genomic and subgenomic RNAs of many viruses phylogenetically related to TCV. More importantly, introducing a new CCS motif elsewhere in the TCV genome was alone sufficient to cause the emergence of another lncRNA. Finally, abolishing ttsgR by mutating its 5′ CCS gave rise to a TCV mutant that failed to compete with wild-type TCV in Arabidopsis . Collectively, our results unveil a replication-dependent mechanism for the biogenesis of viral lncRNAs, thus suggesting that multiple mechanisms, individually or in combination, may be responsible for viral lncRNA production. IMPORTANCE Many positive-strand (+) RNA viruses produce long noncoding RNAs (lncRNAs) during the process of cellular infections and mobilize these lncRNAs to counteract antiviral defenses, as well as coordinate the translation of viral proteins. Most viral lncRNAs arise from 5′-to-3′ degradation of longer viral RNAs being stalled at stable secondary structures. Here, we report a viral lncRNA that is produced by the replication machinery of turnip crinkle virus (TCV). This lncRNA, designated ttsgR, shares the terminal characteristics with TCV genomic and subgenomic RNAs and overaccumulates in the presence of moderately overexpressed TCV RNA-dependent RNA polymerase (RdRp). Furthermore, templates that are of similar sizes as ttsgR are readily replicated by TCV replication proteins (p28 and RdRp) provided from nonviral sources. In summary, this study establishes an approach for uncovering low abundance viral lncRNAs, and characterizes a replicating TCV lncRNA. Similar investigations on human-pathogenic (+) RNA viruses could yield novel therapeutic targets. 
    more » « less
  3. Background: Long non-coding Ribonucleic Acids (lncRNAs) can be localized to different cellular compartments, such as the nuclear and the cytoplasmic regions. Their biological functions are influenced by the region of the cell where they are located. Compared to the vast number of lncRNAs, only a relatively small proportion have annotations regarding their subcellular localization. It would be helpful if those few annotated lncRNAs could be leveraged to develop predictive models for localization of other lncRNAs. Methods: Conventional computational methods use q-mer profiles from lncRNA sequences and train machine learning models such as support vector machines and logistic regression with the profiles. These methods focus on the exact q-mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these variabilities might improve our ability to model lncRNAs and their localization. Thus, we build on inexact q-mers and use machine learning/deep learning techniques to study three specific problems in lncRNA subcellular localization, namely, prediction of lncRNA localization using inexact q-mers, the issue of whether lncRNA localization is cell-type-specific, and the notion of switching (lncRNA) genes. Results: We performed our analysis using data on lncRNA localization across 15 cell lines. Our results showed that using inexact q-mers (with q = 6) can improve the lncRNA localization prediction performance compared to using exact q-mers. Further, we showed that lncRNA localization, in general, is not cell-line-specific. We also identified a category of LncRNAs which switch cellular compartments between different cell lines (we call them switching lncRNAs). These switching lncRNAs complicate the problem of predicting lncRNA localization using machine learning models, showing that lncRNA localization is still a major challenge. 
    more » « less
  4. Abstract Previously, we have shown that apoplastic wash fluid (AWF) purified from Arabidopsis leaves contains small RNAs (sRNAs). To investigate whether these sRNAs are encapsulated inside extracellular vesicles (EVs), we treated EVs isolated from Arabidopsis leaves with the protease trypsin and RNase A, which should degrade RNAs located outside EVs but not those located inside. These analyses revealed that apoplastic RNAs are mostly located outside and are associated with proteins. Further analyses of these extracellular RNAs (exRNAs) revealed that they include both sRNAs and long noncoding RNAs (lncRNAs), including circular RNAs (circRNAs). We also found that exRNAs are highly enriched in the posttranscriptional modification N6-methyladenine (m6A). Consistent with this, we identified a putative m6A-binding protein in AWF, GLYCINE-RICH RNA-BINDING PROTEIN 7 (GRP7), as well as the sRNA-binding protein ARGONAUTE2 (AGO2). These two proteins coimmunoprecipitated with lncRNAs, including circRNAs. Mutation of GRP7 or AGO2 caused changes in both the sRNA and lncRNA content of AWF, suggesting that these proteins contribute to the secretion and/or stabilization of exRNAs. We propose that exRNAs located outside of EVs mediate host-induced gene silencing, rather than RNA located inside EVs. 
    more » « less
  5. SEquence Evaluation throughk-mer Representation (SEEKR) is a method of sequence comparison that uses sequence substrings calledk-mers to quantify the nonlinear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimateP-values that ascribe statistical significance to SEEKR-derived similarities, as well as visualize different aspects ofk-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that containXIST-like sequence features, and we demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and we outline the best practice of visualizing RNA-seq read density to evaluate support for lncRNA annotations before their in-depth study in cell types of interest. 
    more » « less