Abstract Small proteins (SPs) are typically characterized as eukaryotic proteins shorter than 100 amino acids and prokaryotic proteins shorter than 50 amino acids. Historically, they were disregarded because of the arbitrary size thresholds to define proteins. However, recent research has revealed the existence of many SPs and their crucial roles. Despite this, the identification of SPs and the elucidation of their functions are still in their infancy. To pave the way for future SP studies, we briefly introduce the limitations and advancements in experimental techniques for SP identification. We then provide an overview of available computational tools for SP identification, their constraints, and their evaluation. Additionally, we highlight existing resources for SP research. This survey aims to initiate further exploration into SPs and encourage the development of more sophisticated computational tools for SP identification in prokaryotes and microbiomes.
more »
« less
PSPI: A deep learning approach for prokaryotic small protein identification
Small Proteins (SPs) are pivotal in various cellular functions such as immunity, defense, and communication. Despite their significance, identifying them is still in its infancy. Existing computational tools are tailored to specific eukaryotic species, leaving only a few options for SP identification in prokaryotes. In addition, these existing tools still have suboptimal performance in SP identification. To fill this gap, we introduce PSPI, a deep learning-based approach designed specifically for predicting prokaryotic SPs. We showed that PSPI had a high accuracy in predicting generalized sets of prokaryotic SPs and sets specific to the human metagenome. Compared with three existing tools, PSPI was faster and showed greater precision, sensitivity, and specificity not only for prokaryotic SPs but also for eukaryotic ones. We also observed that the incorporation of (n,k)-mers greatly enhances the performance of PSPI, suggesting that many SPs may contain short linear motifs. The PSPI tool, which is freely available athttps://www.cs.ucf.edu/∼xiaoman/tools/PSPI/, will be useful for studying SPs as a tool for identifying prokaryotic SPs and it can be trained to identify other types of SPs as well.
more »
« less
- Award ID(s):
- 2015838
- PAR ID:
- 10640670
- Publisher / Repository:
- Frontiers Media SA
- Date Published:
- Journal Name:
- Frontiers in Genetics
- Volume:
- 15
- ISSN:
- 1664-8021
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, calledGiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5$$\times$$ speed improvement over its CPU-only predecessor, HiCOPS, and over 10$$\times$$ improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at:https://github.com/pcdslab/gicops.more » « less
-
Bacteroides, the prominent bacteria in the human gut, play a crucial role in degrading complex polysaccharides. Their abundance is influenced by phages belonging to theCrassviralesorder. Despite identifying over 600Crassviralesgenomes computationally, only few have been successfully isolated. Continued efforts in isolation of moreCrassviralesgenomes can provide insights into phage-host-evolution and infection mechanisms. We focused on wastewater samples, as potential sources of phages infecting variousBacteroideshosts. Sequencing, assembly, and characterization of isolated phages revealed 14 complete genomes belonging to three novelCrassviralesspecies infectingBacteroides cellulosilyticusWH2. These species,Kehishuvirussp. ‘tikkala’ strain Bc01,Kolpuevirussp. ‘frurule’ strain Bc03, and ‘Rudgehvirus jaberico’ strain Bc11, spanned two families, and three genera, displaying a broad range of virion productions. Upon testing all successfully culturedCrassviralesspecies and their respective bacterial hosts, we discovered that they do not exhibit co-evolutionary patterns with their bacterial hosts. Furthermore, we observed variations in gene similarity, with greater shared similarity observed within genera. However, despite belonging to different genera, the three novel species shared a unique structural gene that encodes the tail spike protein. When investigating the relationship between this gene and host interaction, we discovered evidence of purifying selection, indicating its functional importance. Moreover, our analysis demonstrated that this tail spike protein binds to the TonB-dependent receptors present on the bacterial host surface. Combining these observations, our findings provide insights into phage-host interactions and present threeCrassviralesspecies as an ideal system for controlled infectivity experiments on one of the most dominant members of the human enteric virome.more » « less
-
Fraser, Claire M. (Ed.)ABSTRACT Metagenomics is a powerful method for interpreting the ecological roles and physiological capabilities of mixed microbial communities. Yet, many tools for processing metagenomic data are neither designed to consider eukaryotes nor are they built for an increasing amount of sequence data. EukHeist is an automated pipeline to retrieve eukaryotic and prokaryotic metagenome-assembled genomes (MAGs) from large-scale metagenomic sequence data sets. We developed the EukHeist workflow to specifically process large amounts of both metagenomic and/or metatranscriptomic sequence data in an automated and reproducible fashion. Here, we applied EukHeist to the large-size fraction data (0.8–2,000 µm) from Tara Oceans to recover both eukaryotic and prokaryotic MAGs, which we refer to as TOPAZ (Tara Oceans Particle-Associated MAGs). The TOPAZ MAGs consisted of >900 environmentally relevant eukaryotic MAGs and >4,000 bacterial and archaeal MAGs. The bacterial and archaeal TOPAZ MAGs expand upon the phylogenetic diversity of likely particle- and host-associated taxa. We use these MAGs to demonstrate an approach to infer the putative trophic mode of the recovered eukaryotic MAGs. We also identify ecological cohorts of co-occurring MAGs, which are driven by specific environmental factors and putative host-microbe associations. These data together add to a number of growing resources of environmentally relevant eukaryotic genomic information. Complementary and expanded databases of MAGs, such as those provided through scalable pipelines like EukHeist, stand to advance our understanding of eukaryotic diversity through increased coverage of genomic representatives across the tree of life. IMPORTANCESingle-celled eukaryotes play ecologically significant roles in the marine environment, yet fundamental questions about their biodiversity, ecological function, and interactions remain. Environmental sequencing enables researchers to document naturally occurring protistan communities, without culturing bias, yet metagenomic and metatranscriptomic sequencing approaches cannot separate individual species from communities. To more completely capture the genomic content of mixed protistan populations, we can create bins of sequences that represent the same organism (metagenome-assembled genomes [MAGs]). We developed the EukHeist pipeline, which automates the binning of population-level eukaryotic and prokaryotic genomes from metagenomic reads. We show exciting insight into what protistan communities are present and their trophic roles in the ocean. Scalable computational tools, like EukHeist, may accelerate the identification of meaningful genetic signatures from large data sets and complement researchers’ efforts to leverage MAG databases for addressing ecological questions, resolving evolutionary relationships, and discovering potentially novel biodiversity.more » « less
-
Abstract Conservation of migratory species exhibiting wide‐ranging and multidimensional behaviors is challenged by management efforts that only utilize horizontal movements or produce static spatial–temporal products. For the deep‐diving, critically endangered eastern Pacific leatherback turtle, tools that predict where turtles have high risks of fisheries interactions are urgently needed to prevent further population decline. We incorporated horizontal–vertical movement model results with spatial–temporal kernel density estimates and threat data (gear‐specific fishing) to develop monthly maps of spatial risk. Specifically, we applied multistate hidden Markov models to a biotelemetry data set (n = 28 leatherback tracks, 2004–2007). Tracks with dive information were used to characterize turtle behavior as belonging to 1 of 3 states (transiting, residential with mixed diving, and residential with deep diving). Recent fishing effort data from Global Fishing Watch were integrated with predicted behaviors and monthly space‐use estimates to create maps of relative risk of turtle–fisheries interactions. Drifting (pelagic) longline fishing gear had the highest average monthly fishing effort in the study region, and risk indices showed this gear to also have the greatest potential for high‐risk interactions with turtles in a residential, deep‐diving behavioral state. Monthly relative risk surfaces for all gears and behaviors were added to South Pacific TurtleWatch (SPTW) (https://www.upwell.org/sptw), a dynamic management tool for this leatherback population. These modifications will refine SPTW's capability to provide important predictions of potential high‐risk bycatch areas for turtles undertaking specific behaviors. Our results demonstrate how multidimensional movement data, spatial–temporal density estimates, and threat data can be used to create a unique conservation tool. These methods serve as a framework for incorporating behavior into similar tools for other aquatic, aerial, and terrestrial taxa with multidimensional movement behaviors.more » « less
An official website of the United States government

