skip to main content


Title: Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests
Abstract

As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

 
more » « less
Award ID(s):
1716909
NSF-PAR ID:
10368154
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Regional Environmental Change
Volume:
22
Issue:
2
ISSN:
1436-3798
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Non-migration is an adaptive strategy that has received little attention in environmental migration studies. We explore the leveraging factors of non-migration decisions of communities at risk in coastal Bangladesh, where exposure to both rapid- and slow-onset natural disasters is high. We apply the Protection Motivation Theory (PMT) to empirical data and assess how threat perception and coping appraisal influences migration decisions in farming communities suffering from salinization of cropland. This study consists of data collected through quantitative household surveys ( n  = 200) and semi-structured interviews from four villages in southwest coastal Bangladesh. Results indicate that most respondents are unwilling to migrate, despite better economic conditions and reduced environmental risk in other locations. Land ownership, social connectedness, and household economic strength are the strongest predictors of non-migration decisions. This study is the first to use the PMT to understand migration-related behaviour and the findings are relevant for policy planning in vulnerable regions where exposure to climate-related risks is high but populations are choosing to remain in place. 
    more » « less
  2. Abstract

    Water monitoring in households provides occupants and utilities with key information to support water conservation and efficiency in the residential sector. High costs, intrusiveness, and practical complexity limit appliance-level monitoring via sub-meters on every water-consuming end use in households. Non-intrusive machine learning methods have emerged as promising techniques to analyze observed data collected by a single meter at the inlet of the house and estimate the disaggregated contribution of each water end use. While fine temporal resolution data allow for more accurate end-use disaggregation, there is an inevitable increase in the amount of data that needs to be stored and analyzed. To explore this tradeoff and advance previous studies based on synthetic data, we first collected 1 s resolution indoor water use data from a residential single-point smart water metering system installed at a four-person household, as well as ground-truth end-use labels based on a water diary recorded over a 4-week study period. Second, we trained a supervised machine learning model (random forest classifier) to classify six water end-use categories across different temporal resolutions and two different model calibration scenarios. Finally, we evaluated the results based on three different performance metrics (micro, weighted, and macro F1 scores). Our findings show that data collected at 1- to 5-s intervals allow for better end-use classification (weighted F-score higher than 0.85), particularly for toilet events; however, certain water end uses (e.g., shower and washing machine events) can still be predicted with acceptable accuracy even at coarser resolutions, up to 1 min, provided that these end-use categories are well represented in the training dataset. Overall, our study provides insights for further water sustainability research and widespread deployment of smart water meters.

     
    more » « less
  3. Abstract

    Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique.

    To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics.

    We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration.

    Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

     
    more » « less
  4. Denef, Vincent J. (Ed.)
    ABSTRACT Unconventional oil and gas (UOG) extraction is increasing exponentially around the world, as new technological advances have provided cost-effective methods to extract hard-to-reach hydrocarbons. While UOG has increased the energy output of some countries, past research indicates potential impacts in nearby stream ecosystems as measured by geochemical and microbial markers. Here, we utilized a robust data set that combines 16S rRNA gene amplicon sequencing (DNA), metatranscriptomics (RNA), geochemistry, and trace element analyses to establish the impact of UOG activity in 21 sites in northern Pennsylvania. These data were also used to design predictive machine learning models to determine the UOG impact on streams. We identified multiple biomarkers of UOG activity and contributors of antimicrobial resistance within the order Burkholderiales . Furthermore, we identified expressed antimicrobial resistance genes, land coverage, geochemistry, and specific microbes as strong predictors of UOG status. Of the predictive models constructed ( n  = 30), 15 had accuracies higher than expected by chance and area under the curve values above 0.70. The supervised random forest models with the highest accuracy were constructed with 16S rRNA gene profiles, metatranscriptomics active microbial composition, metatranscriptomics active antimicrobial resistance genes, land coverage, and geochemistry ( n  = 23). The models identified the most important features within those data sets for classifying UOG status. These findings identified specific shifts in gene presence and expression, as well as geochemical measures, that can be used to build robust models to identify impacts of UOG development. IMPORTANCE The environmental implications of unconventional oil and gas extraction are only recently starting to be systematically recorded. Our research shows the utility of microbial communities paired with geochemical markers to build strong predictive random forest models of unconventional oil and gas activity and the identification of key biomarkers. Microbial communities, their transcribed genes, and key biomarkers can be used as sentinels of environmental changes. Slight changes in microbial function and composition can be detected before chemical markers of contamination. Potential contamination, specifically from biocides, is especially concerning due to its potential to promote antibiotic resistance in the environment. Additionally, as microbial communities facilitate the bulk of nutrient cycling in the environment, small changes may have long-term repercussions. Supervised random forest models can be used to identify changes in those communities, greatly enhance our understanding of what such impacts entail, and inform environmental management decisions. 
    more » « less
  5. The Bangladesh Environment and Migration Survey (BEMS) collects detailed retrospective information about migration trips in southwest Bangladesh, including the first, last, and second-to-last to internal destinations, India, and other international destinations. BEMS collects information about the year, origin, destination, and duration of all trips. Furthermore, BEMS includes information on migration and livelihood histories, socioeconomic conditions, agricultural resources and practices, disasters and perceptions about environment, and self-reported health.

    Dataset 1 is a household-level file with information about household composition, economic and migratory activity of household members, land ownership/usage, business ownership, household environmental perceptions, environmental conditions, agricultural activities, and physical and psychological health/well-being of household members. Dataset 2 is an individual-level file containing details of internal and international migration trips, as well as measures of economic and social activity during those trips. It also contains information provided by household heads, spouses, and other migrants in the household. Dataset 3 is an individual-level data file that provides general demographic information and brief migration history for each member of a surveyed household. It also includes health information for the head of household and spouse.

    The purpose of the Bangladesh Environment and Migration Survey (BEMS) is to understand patterns and processes of contemporary internal and international migration in Bangladesh. The project derives from a multi-disciplinary research effort that will generate data on the characteristics and behavior of Bangladeshi migrants and non-migrants and the communities in which they live, and examine whether and how environmental stressors (e.g., salinity, riverbank erosion) affect patterns of migration in this region. The household ethnosurvey is administered to self-identified household heads and spouses in randomly selected households. After gathering social, demographic, and economic information on households and their members, interviewers will collect basic information on each person's first, 2nd to last, and last (or most recent) internal and international migration trips. From household heads and spouses, they will compile migration histories and administer a detailed series of questions about a selection of these trips, focusing on economic livelihoods, methods of moving, connections to other migrants, and use of health and school services. In addition to detailed migration histories, the BEMS will collect information about household wealth, physical conditions of households and communities, and perceptions of environmental conditions. It will also gather some self-reported health information about household members, such as recent illnesses, use of health services, height and weight, and diet. The BEMS is closely modeled on the sampling design and ethnosurvey used in the Mexican Migration Project. The BEMS data were collected in 20 research sites from a random sample of 200 households in each site in 2019. BEMS data include a total of 4,000 households in communities broadly covering the southwest region of Bangladesh. Households in southwest Bangladesh. Smallest Geographic Unit: Administrative region

    For more information about this study, please visit the ISEE Bangladesh project website.

     
    more » « less