skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 ( E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance.  more » « less
Award ID(s):
2050232
PAR ID:
10390399
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Bioinformatics and Biology Insights
Volume:
16
ISSN:
1177-9322
Page Range / eLocation ID:
117793222211183
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Bacteria contain a diverse set of RNAs to provide tight regulation of gene expression in response to environmental stimuli. Bacterial small RNAs (sRNAs) work in conjunction with protein cofactors to bind complementary mRNA sequences in the cell, leading to up‐ or downregulation of protein synthesis.In vivoimaging of sRNAs can aid in understanding their spatiotemporal dynamics in real time, which inspires new ways to manipulate these systems for a variety of applications including synthetic biology and therapeutics. Current methods for sRNA imaging are quite limitedin vivoand do not provide real‐time information about fluctuations in sRNA levels. Herein, we describe our efforts toward the development of an RNA‐based fluorescent biosensor for bacterial sRNA bothin vitroandin vivo. We validated these sensors for three different bacterial sRNAs inEscherichia coliand demonstrated that the designs provide a bright, sequence‐specific signal output in response to exogenous and endogenous RNA targets. 
    more » « less
  2. Parkhill, Julian (Ed.)
    ABSTRACT RNA transcripts are potential therapeutic targets, yet bacterial transcripts have uncharacterized biodiversity. We developed an algorithm for transcript prediction called tp.py using it to predict transcripts (mRNA and other RNAs) inEscherichia coliK12 and E2348/69 strains (Bacteria:gamma-Proteobacteria),Listeria monocytogenesstrains Scott A and RO15 (Bacteria:Firmicute),Pseudomonas aeruginosastrains SG17M and NN2 strains (Bacteria:gamma-Proteobacteria), andHaloferax volcanii(Archaea:Halobacteria). From >5 millionE. coliK12 and >3 millionE. coliE2348/69 newly generated Oxford Nanopore Technologies direct RNA sequencing reads, 2,487 K12 mRNAs and 1,844 E2348/69 mRNAs were predicted, with the K12 mRNAs containing more than half of the predictedE. coliK12 proteins. While the number of predicted transcripts varied by strain based on the amount of sequence data used, across all strains examined, the predicted average size of the mRNAs was 1.6–1.7 kbp, while the median size of the 5′- and 3′-untranslated regions (UTRs) were 30–90 bp. Given the lack of bacterial and archaeal transcript annotation, most predictions were of novel transcripts, but we also predicted many previously characterized mRNAs and ncRNAs, including post-transcriptionally generated transcripts and small RNAs associated with pathogenesis in theE. coliE2348/69LEEpathogenicity islands. We predicted small transcripts in the 100–200 bp range as well as >10 kbp transcripts for all strains, with the longest transcript for two of the seven strains being thenuooperon transcript, and for another two strains it was a phage/prophage transcript. This quick, easy, and reproducible method will facilitate the presentation of transcripts, and UTR predictions alongside coding sequences and protein predictions in bacterial genome annotation as important resources for the research community.IMPORTANCEOur understanding of bacterial and archaeal genes and genomes is largely focused on proteins since there have only been limited efforts to describe bacterial/archaeal RNA diversity. This contrasts with studies on the human genome, where transcripts were sequenced prior to the release of the human genome over two decades ago. We developed software for the quick, easy, and reproducible prediction of bacterial and archaeal transcripts from Oxford Nanopore Technologies direct RNA sequencing data. These predictions are urgently needed for more accurate studies examining bacterial/archaeal gene regulation, including regulation of virulence factors, and for the development of novel RNA-based therapeutics and diagnostics to combat bacterial pathogens, like those with extreme antimicrobial resistance. 
    more » « less
  3. Abstract Bacteria use a multi-layered regulatory strategy to precisely and rapidly tune gene expression in response to environmental cues. Small RNAs (sRNAs) form an important layer of gene expression control and most act post-transcriptionally to control translation and stability of mRNAs. We have shown that at least five different sRNAs inEscherichia coliregulate the cyclopropane fatty acid synthase (cfa) mRNA. These sRNAs bind at different sites in the long 5’ untranslated region (UTR) ofcfamRNA and previous work suggested that they modulate RNase E-dependent mRNA turnover. Recently, thecfa5’ UTR was identified as a site of Rho-dependent transcription termination, leading us to hypothesize that the sRNAs might also regulatecfatranscription elongation. In this study we find that a pyrimidine-rich region flanked by sRNA binding sites in thecfa5’ UTR is required for premature Rho-dependent termination. We discovered that both the activating sRNA RydC and repressing sRNA CpxQ regulatecfaprimarily by modulating Rho-dependent termination ofcfatranscription, with only a minor effect on RNase E-mediated turnover ofcfamRNA. A stem-loop structure in thecfa5’ UTR sequesters the pyrimidine-rich region required for Rho-dependent termination. CpxQ binding to the 5’ portion of the stem increases Rho-dependent termination whereas RydC binding downstream of the stem decreases termination. These results reveal the versatile mechanisms sRNAs use to regulate target gene expression at transcriptional and post-transcriptional levels and demonstrate that regulation by sRNAs in long UTRs can involve modulation of transcription elongation. ImportanceBacteria respond to stress by rapidly regulating gene expression. Regulation can occur through control of messenger RNA (mRNA) production (transcription elongation), stability of mRNAs, or translation of mRNAs. Bacteria can use small RNAs (sRNAs) to regulate gene expression at each of these steps, but we often do not understand how this works at a molecular level. In this study, we find that sRNAs inEscherichia coliregulate gene expression at the level of transcription elongation by promoting or inhibiting transcription termination by a protein called Rho. These results help us understand new molecular mechanisms of gene expression regulation in bacteria. 
    more » « less
  4. Abstract Trans-species RNA interference (RNAi) occurs naturally when small RNAs (sRNAs) silence genes in species different from their origin. This phenomenon has been observed between plants and various organisms including fungi, animals and other plant species. Understanding the mechanisms used in natural cases of trans-species RNAi, such as sRNA processing and movement, will enable more effective development of crop protection methods using host-induced gene silencing (HIGS). Recent progress has been made in understanding the mechanisms of cell-to-cell and long-distance movement of sRNAs within individual plants. This increased understanding of endogenous plant sRNA movement may be translatable to trans-species sRNA movement. Here, we review diverse cases of natural trans-species RNAi focusing on current theories regarding intercellular and long-distance sRNA movement. We also touch on trans-species sRNA evolution, highlighting its research potential and its role in improving the efficacy of HIGS. 
    more » « less
  5. Abstract Several protein families participate in the biogenesis and function of small RNAs (sRNAs) in plants. Those with primary roles include Dicer-like (DCL), RNA-dependent RNA polymerase (RDR), and Argonaute (AGO) proteins. Protein families such as double-stranded RNA-binding (DRB), SERRATE (SE), and SUPPRESSION OF SILENCING 3 (SGS3) act as partners of DCL or RDR proteins. Here, we present curated annotations and phylogenetic analyses of seven sRNA pathway protein families performed on 196 species in the Viridiplantae (aka green plants) lineage. Our results suggest that the RDR3 proteins emerged earlier than RDR1/2/6. RDR6 is found in filamentous green algae and all land plants, suggesting that the evolution of RDR6 proteins coincides with the evolution of phased small interfering RNAs (siRNAs). We traced the origin of the 24-nt reproductive phased siRNA-associated DCL5 protein back to the American sweet flag (Acorus americanus), the earliest diverged, extant monocot species. Our analyses of AGOs identified multiple duplication events of AGO genes that were lost, retained, or further duplicated in subgroups, indicating that the evolution of AGOs is complex in monocots. The results also refine the evolution of several clades of AGO proteins, such as AGO4, AGO6, AGO17, and AGO18. Analyses of nuclear localization signal sequences and catalytic triads of AGO proteins shed light on the regulatory roles of diverse AGOs. Collectively, this work generates a curated and evolutionarily coherent annotation for gene families involved in plant sRNA biogenesis/function and provides insights into the evolution of major sRNA pathways. 
    more » « less