Title: Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests
Abstract As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration. more »« less
Abstract Non-migration is an adaptive strategy that has received little attention in environmental migration studies. We explore the leveraging factors of non-migration decisions of communities at risk in coastal Bangladesh, where exposure to both rapid- and slow-onset natural disasters is high. We apply the Protection Motivation Theory (PMT) to empirical data and assess how threat perception and coping appraisal influences migration decisions in farming communities suffering from salinization of cropland. This study consists of data collected through quantitative household surveys ( n = 200) and semi-structured interviews from four villages in southwest coastal Bangladesh. Results indicate that most respondents are unwilling to migrate, despite better economic conditions and reduced environmental risk in other locations. Land ownership, social connectedness, and household economic strength are the strongest predictors of non-migration decisions. This study is the first to use the PMT to understand migration-related behaviour and the findings are relevant for policy planning in vulnerable regions where exposure to climate-related risks is high but populations are choosing to remain in place.
Carrico, Amanda; Donato, Katharine M.
(, ICPSR - Interuniversity Consortium for Political and Social Research)
The Bangladesh Environment and Migration Survey (BEMS) collects detailed retrospective information about migration trips in southwest Bangladesh, including the first, last, and second-to-last to internal destinations, India, and other international destinations. BEMS collects information about the year, origin, destination, and duration of all trips. Furthermore, BEMS includes information on migration and livelihood histories, socioeconomic conditions, agricultural resources and practices, disasters and perceptions about environment, and self-reported health.</p> Dataset 1 is a household-level file with information about household composition, economic and migratory activity of household members, land ownership/usage, business ownership, household environmental perceptions, environmental conditions, agricultural activities, and physical and psychological health/well-being of household members. Dataset 2 is an individual-level file containing details of internal and international migration trips, as well as measures of economic and social activity during those trips. It also contains information provided by household heads, spouses, and other migrants in the household. Dataset 3 is an individual-level data file that provides general demographic information and brief migration history for each member of a surveyed household. It also includes health information for the head of household and spouse.</p> The purpose of the Bangladesh Environment and Migration Survey (BEMS) is to understand patterns and processes of contemporary internal and international migration in Bangladesh. The project derives from a multi-disciplinary research effort that will generate data on the characteristics and behavior of Bangladeshi migrants and non-migrants and the communities in which they live, and examine whether and how environmental stressors (e.g., salinity, riverbank erosion) affect patterns of migration in this region. The household ethnosurvey is administered to self-identified household heads and spouses in randomly selected households. After gathering social, demographic, and economic information on households and their members, interviewers will collect basic information on each person's first, 2nd to last, and last (or most recent) internal and international migration trips. From household heads and spouses, they will compile migration histories and administer a detailed series of questions about a selection of these trips, focusing on economic livelihoods, methods of moving, connections to other migrants, and use of health and school services. In addition to detailed migration histories, the BEMS will collect information about household wealth, physical conditions of households and communities, and perceptions of environmental conditions. It will also gather some self-reported health information about household members, such as recent illnesses, use of health services, height and weight, and diet. The BEMS is closely modeled on the sampling design and ethnosurvey used in the Mexican Migration Project. The BEMS data were collected in 20 research sites from a random sample of 200 households in each site in 2019. BEMS data include a total of 4,000 households in communities broadly covering the southwest region of Bangladesh. Households in southwest Bangladesh. Smallest Geographic Unit: Administrative region For more information about this study, please visit the ISEE Bangladesh project website.</p>
Zhang, Lujun; Wang, Yanshan; Chen, Jingwen; Chen, Jun
(, Frontiers in Genetics)
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.
Romero, Julian D; Feijoo-Garcia, Miguel A; Nanda, Gaurav; Newell, Brittany; Magana, Alejandra J
(, Big Data and Cognitive Computing)
Leung, Carson
(Ed.)
Examining the effectiveness of machine learning techniques in analyzing engineering students’ decision-making processes through topic modeling during simulation-based design tasks is crucial for advancing educational methods and tools. Thus, this study presents a comparative analysis of different supervised and unsupervised machine learning techniques for topic modeling, along with human validation. Hence, this manuscript contributes by evaluating the effectiveness of these techniques in identifying nuanced topics within the argumentation framework and improving computational methods for assessing students’ abilities and performance levels based on their informed decisions. This study examined the decision-making processes of engineering students as they participated in a simulation-based design challenge. During this task, students were prompted to use an argumentation framework to articulate their claims, evidence, and reasoning, by recording their informed design decisions in a design journal. This study combined qualitative and computational methods to analyze the students’ design journals and ensured the accuracy of the findings through the researchers’ review and interpretations of the results. Different machine learning models, including random forest, SVM, and K-nearest neighbors (KNNs), were tested for multilabel regression, using preprocessing techniques such as TF-IDF, GloVe, and BERT embeddings. Additionally, hyperparameter optimization and model interpretability were explored, along with models like RNNs with LSTM, XGBoost, and LightGBM. The results demonstrate that both supervised and unsupervised machine learning models effectively identified nuanced topics within the argumentation framework used during the design challenge of designing a zero-energy home for a Midwestern city using a CAD/CAE simulation platform. Notably, XGBoost exhibited superior predictive accuracy in estimating topic proportions, highlighting its potential for broader application in engineering education.
Chatter, a self-excited vibration phenomenon, is a critical challenge in high-speed machining operations, affecting tool life, product surface quality, and overall process efficiency. While machine learning models trained on simulated data have shown promise in detecting chatter, their real-world applicability remains uncertain due to discrepancies between simulated and actual machining environments. The primary goal of this study is to bridge the gap between simulation-based machine learning models and real-world applications by developing and validating a Random Forest-based chatter detection system. This research focuses on improving manufacturing efficiency through reliable chatter detection by integrating Operational Modal Analysis (OMA), Receptance Coupling Substructure Analysis (RCSA), and Transfer Learning (TL). The study applies a Random Forest classification model trained on over 140,000 simulated machining datasets, incorporating techniques like Operational Modal Analysis (OMA), Receptance Coupling Substructure Analysis (RCSA), and Transfer Learning (TL) to adapt the model for real-world operational data. The model is validated against 1600 real-world machining datasets, achieving an accuracy of 86.1%, with strong precision and recall scores. The results demonstrate the model’s robustness and potential for practical implementation in industrial settings, highlighting challenges such as sensor noise and variability in machining conditions. This work advances the use of predictive analytics in machining processes, offering a data-driven solution to improve manufacturing efficiency through more reliable chatter detection.
Best, Kelsea, Gilligan, Jonathan, Baroud, Hiba, Carrico, Amanda, Donato, Katharine, and Mallick, Bishawjit. Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests. Regional Environmental Change 22.2 Web. doi:10.1007/s10113-022-01915-1.
Best, Kelsea, Gilligan, Jonathan, Baroud, Hiba, Carrico, Amanda, Donato, Katharine, & Mallick, Bishawjit. Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests. Regional Environmental Change, 22 (2). https://doi.org/10.1007/s10113-022-01915-1
Best, Kelsea, Gilligan, Jonathan, Baroud, Hiba, Carrico, Amanda, Donato, Katharine, and Mallick, Bishawjit.
"Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests". Regional Environmental Change 22 (2). Country unknown/Code not available: Springer Science + Business Media. https://doi.org/10.1007/s10113-022-01915-1.https://par.nsf.gov/biblio/10368154.
@article{osti_10368154,
place = {Country unknown/Code not available},
title = {Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests},
url = {https://par.nsf.gov/biblio/10368154},
DOI = {10.1007/s10113-022-01915-1},
abstractNote = {Abstract As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.},
journal = {Regional Environmental Change},
volume = {22},
number = {2},
publisher = {Springer Science + Business Media},
author = {Best, Kelsea and Gilligan, Jonathan and Baroud, Hiba and Carrico, Amanda and Donato, Katharine and Mallick, Bishawjit},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.