Objectives: This study introduces MetaBIDx, a computational method designed to enhance species prediction in metagenomic environments. The method addresses the challenge of accurate species identification in complex microbiomes, which is due to the large number of generated reads and the ever-expanding number of bacterial genomes. Bacterial identification is essential for disease diagnosis and tracing outbreaks associated with microbial infections. Methods: MetaBIDx utilizes a modified Bloom filter for efficient indexing of reference genomes and incorporates a novel strategy for reducing false positives by clustering species based on their genomic coverages by identified reads. The approach was evaluated and compared with several well-established tools across various datasets. Precision, recall, and F1-score were used to quantify the accuracy of species prediction. Results: MetaBIDx demonstrated superior performance compared to other tools, especially in terms of precision and F1-score. The application of clustering based on approximate coverages significantly improved precision in species identification, effectively minimizing false positives. We further demonstrated that other methods can also benefit from our approach to removing false positives by clustering species based on approximate coverages. Conclusion: With a novel approach to reducing false positives and the effective use of a modified Bloom filter to index species, MetaBIDx represents an advancement in metagenomic analysis. The findings suggest that the proposed approach could also benefit other metagenomic tools, indicating its potential for broader application in the field. The study lays the groundwork for future improvements in computational efficiency and the expansion of microbial databases. 
                        more » 
                        « less   
                    
                            
                            DL-TODA: A Deep Learning Tool for Omics Data Analysis
                        
                    
    
            Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1553211
- PAR ID:
- 10403813
- Date Published:
- Journal Name:
- Biomolecules
- Volume:
- 13
- Issue:
- 4
- ISSN:
- 2218-273X
- Page Range / eLocation ID:
- 585
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies.more » « less
- 
            Abstract Background In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. Results Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read’s similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. Conclusions The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching .more » « less
- 
            Abstract High taxonomic diversity in non-industrial human gut microbiomes is often interpreted as beneficial; however, it is unclear if taxonomic diversity engenders ecological resilience (i.e. community stability and metabolic continuity). We estimate resilience through genus and species-level richness, phylogenetic diversity, and evenness in short-chain fatty acid (SCFA) production among a global gut metagenome panel of 12 populations (n = 451) representing industrial and non-industrial lifestyles, including novel metagenomic data from Burkina Faso (n = 90). We observe significantly higher genus-level resilience in non-industrial populations, while SCFA production in industrial populations is driven by a few phylogenetically closely related species (belonging toBacteroidesandClostridium), meaning industrial microbiomes have low resilience potential. Additionally, database bias obfuscates resilience estimates, as we were 2–5 times more likely to identify SCFA-encoding species in industrial microbiomes compared to non-industrial. Overall, we find high phylogenetic diversity, richness, and evenness of bacteria encoding SCFAs in non-industrial gut microbiomes, signaling high potential for resilience in SCFA production, despite database biases that limit metagenomic analysis of non-industrial populations.more » « less
- 
            Deep learning (DL) models have demonstrated state-of-the-art performance in the classification of diagnostic imaging in oncology. However, DL models for medical images can be compromised by adversarial images, where pixel values of input images are manipulated to deceive the DL model. To address this limitation, our study investigates the detectability of adversarial images in oncology using multiple detection schemes. Experiments were conducted on thoracic computed tomography (CT) scans, mammography, and brain magnetic resonance imaging (MRI). For each dataset we trained a convolutional neural network to classify the presence or absence of malignancy. We trained five DL and machine learning (ML)-based detection models and tested their performance in detecting adversarial images. Adversarial images generated using projected gradient descent (PGD) with a perturbation size of 0.004 were detected by the ResNet detection model with an accuracy of 100% for CT, 100% for mammogram, and 90.0% for MRI. Overall, adversarial images were detected with high accuracy in settings where adversarial perturbation was above set thresholds. Adversarial detection should be considered alongside adversarial training as a defense technique to protect DL models for cancer imaging classification from the threat of adversarial images.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    