skip to main content


Title: RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.  more » « less
Award ID(s):
2113360
NSF-PAR ID:
10324492
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Frontiers in Genetics
Volume:
12
ISSN:
1664-8021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Passive ultra high frequency (UHF) radio frequency identification (RFID) tags have the potential to find ubiquitous use in indoor object tracking, localization, and contact tracing. We propose a machine learning-based method for RFID indoor localization using a pattern reconfigurable UHF RFID reader antenna array. The received signal strength indicator (RSSI) values (from 10,000 tags) recorded at the reader antenna units are used as features to evaluate the machine learning models with a train-test split of 75%-25%. The training and testing data is generated by a wireless ray tracing simulator. Five machine learning models: random forest regressor, decision tree regressor, Nu support vector regressor, k nearest regressor, and kernel ridge regressor are compared. Random forest regressor has the lowest localization error both in terms of average Euclidean distance (AED) and root-mean-square error (RMSE). For random forest regressor, localization error results show that 90% of the tags are within 1 meter of their true position, and 67% are within 50 cm of their true position based on Euclidean distance. 
    more » « less
  2. Introduction: Recent studies found that wearable exoskeletons can reduce physical effort and fatigue during squatting. In particular, subject-specific assistance helped to significantly reduce physical effort, shown by reduced metabolic cost, using human-in-the-loop optimization of the exoskeleton parameters. However, measuring metabolic cost using respiratory data has limitations, such as long estimation times, presence of noise, and user discomfort. A recent study suggests that foot contact forces can address those challenges and be used as an alternative metric to the metabolic cost to personalize wearable robot assistance during walking. Methods: In this study, we propose that foot center of pressure (CoP) features can be used to estimate the metabolic cost of squatting using a machine learning method. Five subjects’ foot pressure and metabolic cost data were collected as they performed squats with an ankle exoskeleton at different assistance conditions in our prior study. In this study, we extracted statistical features from the CoP squat trajectories and fed them as input to a random forest model, with the metabolic cost as the output. Results: The model predicted the metabolic cost with a mean error of 0.55 W/kg on unseen test data, with a high correlation (r = 0.89, p < 0.01) between the true and predicted cost. The features of the CoP trajectory in the medial-lateral direction of the foot (xCoP), which relate to ankle eversion-inversion, were found to be important and highly correlated with metabolic cost. Conclusion: Our findings indicate that increased ankle eversion (outward roll of the ankle), which reflects a suboptimal squatting strategy, results in higher metabolic cost. Higher ankle eversion has been linked with the etiology of chronic lower limb injuries. Hence, a CoP-based cost function in human-in-the-loop optimization could offer several advantages, such as reduced estimation time, injury risk mitigation, and better user comfort. 
    more » « less
  3. Abstract

    Skin is the largest mammalian organ and the first defensive barrier against the external environment. The skin and fur of mammals can host a wide variety of ectoparasites, many of which are phylogenetically diverse, specialized, and specifically adapted to their hosts. Among hematophagous dipteran parasites, volatile organic compounds (VOCs) are known to serve as important attractants, leading parasites to compatible sources of blood meals. VOCs have been hypothesized to be mediated by host‐associated bacteria, which may thereby indirectly influence parasitism. Host‐associated bacteria may also influence parasitism directly, as has been observed in interactions between animal gut microbiota and malarial parasites. Hypotheses relating bacterial symbionts and eukaryotic parasitism have rarely been tested among humans and domestic animals, and to our knowledge have not been tested in wild vertebrates. In this study, we used Afrotropical bats, hematophagous ectoparasitic bat flies, and haemosporidian (malarial) parasites vectored by bat flies as a model to test the hypothesis that the vertebrate host microbiome is linked to parasitism in a wild system. We identified significant correlations between bacterial community composition of the skin and dipteran ectoparasite prevalence across four major bat lineages, as well as striking differences in skin microbial network characteristics between ectoparasitized and nonectoparasitized bats. We also identified links between the oral microbiome and presence of malarial parasites among miniopterid bats. Our results support the hypothesis that microbial symbionts may serve as indirect mediators of parasitism among eukaryotic hosts and parasites.

     
    more » « less
  4. Abstract

    Viruses of the phylumNucleocytoviricota, often referred to as “giant viruses,” are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG (TaxonomicInformation ofGiant viruses usingTrademarkOrthologousGroups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1531 quality-checked, phylogenetically diverseNucleocytoviricotagenomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% at the order level and 97.3% at the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm’s performance or the models’ predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% at the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups.

     
    more » « less
  5. Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3. 
    more » « less