skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.  more » « less
Award ID(s):
2113360
PAR ID:
10324492
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Frontiers in Genetics
Volume:
12
ISSN:
1664-8021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Passive ultra high frequency (UHF) radio frequency identification (RFID) tags have the potential to find ubiquitous use in indoor object tracking, localization, and contact tracing. We propose a machine learning-based method for RFID indoor localization using a pattern reconfigurable UHF RFID reader antenna array. The received signal strength indicator (RSSI) values (from 10,000 tags) recorded at the reader antenna units are used as features to evaluate the machine learning models with a train-test split of 75%-25%. The training and testing data is generated by a wireless ray tracing simulator. Five machine learning models: random forest regressor, decision tree regressor, Nu support vector regressor, k nearest regressor, and kernel ridge regressor are compared. Random forest regressor has the lowest localization error both in terms of average Euclidean distance (AED) and root-mean-square error (RMSE). For random forest regressor, localization error results show that 90% of the tags are within 1 meter of their true position, and 67% are within 50 cm of their true position based on Euclidean distance. 
    more » « less
  2. The microbiota has proved to be one of the critical factors for many diseases, and researchers have been using microbiome data for disease prediction. However, models trained on one independent microbiome study may not be easily applicable to other independent studies due to the high level of variability in microbiome data. In this study, we developed a method for improving the generalizability and interpretability of machine learning models for predicting three different diseases (colorectal cancer, Crohn’s disease, and immunotherapy response) using nine independent microbiome datasets. Our method involves combining a smaller dataset with a larger dataset, and we found that using at least 25% of the target samples in the source data resulted in improved model performance. We determined random forest as our top model and employed feature selection to identify common and important taxa for disease prediction across the different studies. Our results suggest that this leveraging scheme is a promising approach for improving the accuracy and interpretability of machine learning models for predicting diseases based on microbiome data. 
    more » « less
  3. The Human Microbiome Project was a research programme that successfully identified associations between microbial species and healthy or diseased individuals. However, a major challenge identified was the absence of model systems for studying host–microbiome interactions, which would increase our capacity to uncover molecular interactions, understand organ-specificity and discover new microbiome-altering health interventions.Caenorhabditis eleganshas been a pioneering model organism for over 70 years but was largely studied in the absence of a microbiome. Recently, ecological sampling of wild nematodes has uncovered a large amount of natural genetic diversity as well as a slew of associated microbiota. The field has now explored the interactions ofC. eleganswith its associated gut microbiome, a defined and non-random microbial community, highlighting its suitability for dissecting host–microbiome interactions. This core microbiome is being used to study the impact of host genetics, age and stressors on microbiome composition. Furthermore, single microbiome species are being used to dissect molecular interactions between microbes and the animal gut. Being amenable to health altering genetic and non-genetic interventions,C. eleganshas emerged as a promising system to generate and test new hypotheses regarding host–microbiome interactions, with the potential to uncover novel paradigms relevant to other systems. This article is part of the theme issue ‘Sculpting the microbiome: how host factors determine and respond to microbial colonization’. 
    more » « less
  4. Summary Habitat fragmentation is a leading cause of biodiversity and ecosystem function loss in the Anthropocene. Despite the importance of plant–microbiome interactions to ecosystem productivity, we have limited knowledge of how fragmentation affects microbiomes and even less knowledge of its consequences for microbial interactions with plants.Combining field surveys, microbiome sequencing, manipulative experiments, and random forest models, we investigated fragmentation legacy effects on soil microbiomes in imperiled pine rocklands, tested how compositional shifts across 14 fragmentation‐altered soil microbiomes affected performance and resource allocation of three native plant species, and identified fragmentation‐responding microbial families underpinning plant performance.Legacies of habitat fragmentation were associated with significant changes in microbial diversity and composition (across three of four community axes). Experiments showed plants often strongly benefited from the microbiome’s presence, but fragmentation‐associated changes in microbiome composition also significantly affected plant performance and resource allocation across all seven metrics examined. Finally, random forest models identified ten fungal and six bacterial families important for plant performance that changed significantly with fragmentation.Our findings not only support the existence of significant fragmentation effects on natural microbiomes, but also demonstrate for the first time that fragmentation‐associated changes in microbiomes can have meaningful consequences for native plant performance and investment. 
    more » « less
  5. Greene, Casey S. (Ed.)
    ABSTRACT UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected. 
    more » « less