skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improved Ensemble Predictive Modeling Techniques for Linked Social Media and Survey Data Sets Subject to Mismatch Error
Modern predictive modeling tools, such as random forests (and related ensemble methods), have become almost ubiquitous in research applications involving innovative combinations of survey methodology and data science. However, an important potential flaw in the widespread application of these methods has not received sufficient research attention to date. Researchers at the junction of computer and survey science frequently leverage linked data sets to study relationships between variables, where the techniques used to link two (or more) data sets may be probabilistic and non-deterministic in nature. If frequent mismatch errors occur when linking two (or more) data sets, the commonly desired outputs of predictive modeling tools describing relationships between variables in the linked data sets (e.g., variable importance, confusion matrices, RMSE, etc.) may be negatively affected, and the true predictive performance of these tools may not be realized. We demonstrate a new methodology based on mixture modeling that is designed to adjust modern predictive modeling tools for the presence of mismatch errors in a linked data set. We evaluate the performance of this new methodology in an application involving the use of observed Twitter/X activity measures and predicted socio-demographic features of Twitter/X users to accurately predict linked measures of political ideology that were collected in a designed survey, where respondents were asked for consent to link any Twitter/X activity data to their survey responses (exactly, based on Twitter/X handles). We find that the new methodology, which we have implemented in R, is able to largely recover results that would have been seen prior to the introduction of mismatch errors in the linked data set.  more » « less
Award ID(s):
2120318
PAR ID:
10647991
Author(s) / Creator(s):
; ;
Publisher / Repository:
GESIS - Leibniz Institute for the Social Sciences
Date Published:
Journal Name:
methods, data, analyses (mda)
ISSN:
2190-4936
Subject(s) / Keyword(s):
modern predictive modeling ensemble methods record linkage mismatch error mixture modeling linked survey and social media data
Format(s):
Medium: X Size: 17 pages Other: application/pdf
Size(s):
17 pages
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Stochastic Watershed Models (SWMs) are emerging tools in hydrologic modeling used to propagate uncertainty into model predictions by adding samples of model error to deterministic simulations. One of the most promising uses of SWMs is uncertainty propagation for hydrologic simulations under climate change. However, a core challenge is that the historical predictive uncertainty may not correctly characterize the error distribution under future climate. For example, the frequency of physical processes (e.g., snow accumulation and melt) may change under climate change, and so too may the frequency of errors associated with those processes. In this work, we explore for the first time non‐stationarity in hydrologic model errors under climate change in an idealized experimental design. We fit one hydrologic model to historical observations, and then fit a second model to the simulations of the first, treating the first model as the true hydrologic system. We then force both models with climate change impacted meteorology and investigate changes to the error distribution between the models. We develop a hybrid machine learning method that maps model state variables to predictive errors, allowing for non‐stationary error distributions based on changes in the frequency of model states. We find that this procedure provides an internally consistent methodology to overcome stationarity assumptions in error modeling and offers an important advance for implementing SWMs under climate change. We test this method on three hydrologically distinct watersheds in California (Feather River, Sacramento River, Calaveras River), finding that the hybrid model performs best in larger and less flashy basins. 
    more » « less
  2. Evaluating whether hydrological models are right for the right reasons demands reproducible model benchmarking and diagnostics that evaluate not just statistical predictive model performance but also internal processes. Such model benchmarking and diagnostic efforts will benefit from standardized methods and ready-to-use toolkits. Using the Jupyter platform, this work presents HydroBench, a model-agnostic benchmarking tool consisting of three sets of metrics: 1) common statistical predictive measures, 2) hydrological signature-based process metrics, including a new time-linked flow duration curve and 3) information-theoretic diagnostics that measure the flow of information among model variables. As a test case, HydroBench was applied to compare two model products (calibrated and uncalibrated) of the National Hydrologic Model - Precipitation Runoff Modeling System (NHM-PRMS) at the Cedar River watershed, WA, United States. Although the uncalibrated model has the highest predictive performance, particularly for high flows, the signature-based diagnostics showed that the model overestimates low flows and poorly represents the recession processes. Elucidating why low flows may have been overestimated, the information-theoretic diagnostics indicated a higher flow of information from precipitation to snowmelt to streamflow in the uncalibrated model compared to the calibrated model, where information flowed more directly from precipitation to streamflow. This test case demonstrated the capability of HydroBench in process diagnostics and model predictive and functional performance evaluations, along with their tradeoffs. Having such a model benchmarking tool not only provides modelers with a comprehensive model evaluation system but also provides an open-source tool that can further be developed by the hydrological community. 
    more » « less
  3. Abstract The advent of the information age has revolutionized data collection and has led to a rapid expansion of available data sources. Methods of data integration are indispensable when a question of interest cannot be addressed using a single data source. Record linkage (RL) is at the forefront of such data integration efforts. Incentives for sharing linked data for secondary analysis have prompted the need for methodology accounting for possible errors at the RL stage. Mismatch error is a common consequence resulting from the use of nonunique or noisy identifiers at that stage. In this paper, we present a framework to enable valid postlinkage inference in the secondary analysis setting in which only the linked file is given. The proposed framework covers a variety of statistical models and can flexibly incorporate information about the underlying RL process. We propose a mixture model for linked records whose two components reflect distributions conditional on match status, i.e. correct or false match. Regarding inference, we develop a method based on composite likelihood and the expectation-maximization algorithm that is implemented in the R package pldamixture. Extensive simulations and case studies involving contemporary RL applications corroborate the effectiveness of our framework. 
    more » « less
  4. Flooding occurs at different scales and unevenly affects urban populations based on the broader social, ecological, and technological system (SETS) characteristics particular to cities. As hydrological models improve in spatial scale and account for more mechanisms of flooding, there is a continuous need to examine the re- lationships between flood exposure and SETS drivers of flood vulnerability. In this study, we related fine-scale measures of future flood exposure—the First Street Foundation’s Flood Factor and estimated change in chance of extreme flood exposure—to SETS indicators like building age, poverty, and historical redlining, at the parcel and census block group (CBG) scales in Portland, OR, Phoenix, AZ, Baltimore, MD, and Atlanta, GA. We used standard regression models and accounted for spatial bias in relationships. The results show that flood exposure was more often correlated with SETS variables at the parcel scale than at the CBG scale, indicating scale dependence. However, these relationships were often inconsistent among cities, indicating place-dependence. We found that marginalized populations were significantly more exposed to future flooding at the CBG scale. Combining newly-available, high-resolution future flood risk estimates with SETS data available at multiple scales offers cities a new set of tools to assess the exposure and multi-dimensional vulnerability of populations. These tools will better equip city managers to proactively plan and implement equitable interventions to meet evolving hazard exposure. 
    more » « less
  5. Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering. 
    more » « less