As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more more »
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Regional Environmental Change
- Springer Science + Business Media
- Sponsoring Org:
- National Science Foundation
More Like this
In harm’s way: Non-migration decisions of people at risk of slow-onset coastal hazards in BangladeshAbstract Non-migration is an adaptive strategy that has received little attention in environmental migration studies. We explore the leveraging factors of non-migration decisions of communities at risk in coastal Bangladesh, where exposure to both rapid- and slow-onset natural disasters is high. We apply the Protection Motivation Theory (PMT) to empirical data and assess how threat perception and coping appraisal influences migration decisions in farming communities suffering from salinization of cropland. This study consists of data collected through quantitative household surveys ( n = 200) and semi-structured interviews from four villages in southwest coastal Bangladesh. Results indicate that most respondents are unwilling to migrate, despite better economic conditions and reduced environmental risk in other locations. Land ownership, social connectedness, and household economic strength are the strongest predictors of non-migration decisions. This study is the first to use the PMT to understand migration-related behaviour and the findings are relevant for policy planning in vulnerable regions where exposure to climate-related risks is high but populations are choosing to remain in place.
Is smart water meter temporal resolution a limiting factor to residential water end-use classification? A quantitative experimental analysis
Water monitoring in households provides occupants and utilities with key information to support water conservation and efficiency in the residential sector. High costs, intrusiveness, and practical complexity limit appliance-level monitoring via sub-meters on every water-consuming end use in households. Non-intrusive machine learning methods have emerged as promising techniques to analyze observed data collected by a single meter at the inlet of the house and estimate the disaggregated contribution of each water end use. While fine temporal resolution data allow for more accurate end-use disaggregation, there is an inevitable increase in the amount of data that needs to be stored and analyzed. To explore this tradeoff and advance previous studies based on synthetic data, we first collected 1 s resolution indoor water use data from a residential single-point smart water metering system installed at a four-person household, as well as ground-truth end-use labels based on a water diary recorded over a 4-week study period. Second, we trained a supervised machine learning model (random forest classifier) to classify six water end-use categories across different temporal resolutions and two different model calibration scenarios. Finally, we evaluated the results based on three different performance metrics (micro, weighted, and macro F1 scores). Our findings showmore »
Geochemistry and Multiomics Data Differentiate Streams in Pennsylvania Based on Unconventional Oil and Gas ActivityDenef, Vincent J. (Ed.)ABSTRACT Unconventional oil and gas (UOG) extraction is increasing exponentially around the world, as new technological advances have provided cost-effective methods to extract hard-to-reach hydrocarbons. While UOG has increased the energy output of some countries, past research indicates potential impacts in nearby stream ecosystems as measured by geochemical and microbial markers. Here, we utilized a robust data set that combines 16S rRNA gene amplicon sequencing (DNA), metatranscriptomics (RNA), geochemistry, and trace element analyses to establish the impact of UOG activity in 21 sites in northern Pennsylvania. These data were also used to design predictive machine learning models to determine the UOG impact on streams. We identified multiple biomarkers of UOG activity and contributors of antimicrobial resistance within the order Burkholderiales . Furthermore, we identified expressed antimicrobial resistance genes, land coverage, geochemistry, and specific microbes as strong predictors of UOG status. Of the predictive models constructed ( n = 30), 15 had accuracies higher than expected by chance and area under the curve values above 0.70. The supervised random forest models with the highest accuracy were constructed with 16S rRNA gene profiles, metatranscriptomics active microbial composition, metatranscriptomics active antimicrobial resistance genes, land coverage, and geochemistry ( n = 23). The models identified themore »
RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered SignalsRandom forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.
Background Heart failure is a leading cause of mortality and morbidity worldwide. Acute heart failure, broadly defined as rapid onset of new or worsening signs and symptoms of heart failure, often requires hospitalization and admission to the intensive care unit (ICU). This acute condition is highly heterogeneous and less well-understood as compared to chronic heart failure. The ICU, through detailed and continuously monitored patient data, provides an opportunity to retrospectively analyze decompensation and heart failure to evaluate physiological states and patient outcomes. Objective The goal of this study is to examine the prevalence of cardiovascular risk factors among those admitted to ICUs and to evaluate combinations of clinical features that are predictive of decompensation events, such as the onset of acute heart failure, using machine learning techniques. To accomplish this objective, we leveraged tele-ICU data from over 200 hospitals across the United States. Methods We evaluated the feasibility of predicting decompensation soon after ICU admission for 26,534 patients admitted without a history of heart failure with specific heart failure risk factors (ie, coronary artery disease, hypertension, and myocardial infarction) and 96,350 patients admitted without risk factors using remotely monitored laboratory, vital signs, and discrete physiological measurements. Multivariate logistic regression andmore »