Abstract Background Behavior and health are inextricably linked. As a result, continuous wearable sensor data offer the potential to predict clinical measures. However, interruptions in the data collection occur, which create a need for strategic data imputation. Objective The objective of this work is to adapt a data generation algorithm to impute multivariate time series data. This will allow us to create digital behavior markers that can predict clinical health measures. Methods We created a bidirectional time series generative adversarial network to impute missing sensor readings. Values are imputed based on relationships between multiple fields and multiple points in time, for single time points or larger time gaps. From the complete data, digital behavior markers are extracted and are mapped to predicted clinical measures. Results We validate our approach using continuous smartwatch data for n = 14 participants. When reconstructing omitted data, we observe an average normalized mean absolute error of 0.0197. We then create machine learning models to predict clinical measures from the reconstructed, complete data with correlations ranging from r = 0.1230 to r = 0.7623. This work indicates that wearable sensor data collected in the wild can be used to offer insights on a person's health in natural settings.
more »
« less
Robust and Scalable Bayes via a Median of Subset Posterior Measures
We propose a novel approach to Bayesian analysis that is provably robust to outliers in the data and often has computational advantages over standard methods. Our technique is based on splitting the data into non-overlapping subgroups, evaluating the posterior distribution given each independent subgroup, and then combining the resulting measures. The main novelty of our approach is the proposed aggregation step, which is based on the evaluation of a median in the space of probability measures equipped with a suitable collection of distances that can be quickly and efficiently evaluated in practice. We present both theoretical and numerical evidence illustrating the improvements achieved by our method.
more »
« less
- Award ID(s):
- 1663870
- PAR ID:
- 10059384
- Date Published:
- Journal Name:
- Journal of machine learning research
- Volume:
- 18
- ISSN:
- 1533-7928
- Page Range / eLocation ID:
- 1-40
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cao, Jason Xinyu; Ge, Ying-En (Ed.)This study explores household-level evacuation decision-making in response to Hurricane Laura, in a context where hurricane risk reduction measures contradicted COVID-19 risk reduction measures. Data were collected using a mail-based survey approach from households along the coast of Texas and Louisiana to explore drivers of and barriers to evacuation, including COVID-19 measures such as negative affect, risk perceptions, protective actions, and exposure. Testing for direct and indirect effects among the drivers of and barriers to evacuation, we find that many of our COVID-19 measures did not have a direct effect on evacuation but did have indirect effects through other factors. We also found evidence of both direct and indirect relationships with regards to more conventional drivers of evacuation found in the literature. We close with a discussion of the limitations and implications of this study.more » « less
-
Algorithmic fairness research has mainly focused on adapting learning models to mitigate discrimination based on protected attributes, yet understanding inherent biases in training data remains largely unexplored. Quantifying these biases is crucial for informed data engineering, as data mining and model development often occur separately. We address this by developing an information-theoretic framework to quantify the marginal impacts of dataset features on the discrimination bias of downstream predictors. We postulate a set of desired properties for candidate discrimination measures and derive measures that (partially) satisfy them. Distinct sets of these properties align with distinct fairness criteria like demographic parity or equalized odds, which we show can be in disagreement and not simultaneously satisfied by a single measure. We use the Shapley value to determine individual features’ contributions to overall discrimination, and prove its effectiveness in eliminating redundancy. We validate our measures through a comprehensive empirical study on numerous real-world and synthetic datasets. For synthetic data, we use a parametric linear structural causal model to generate diverse data correlation structures. Our analysis provides empirically validated guidelines for selecting discrimination measures based on data conditions and fairness criteria, establishing a robust framework for quantifying inherent discrimination bias in datamore » « less
-
Emotions provide critical information regarding a person's health and well-being. Therefore, the ability to track emotion and patterns in emotion over time could provide new opportunities in measuring health longitudinally. This is of particular importance for individuals with bipolar disorder (BD), where emotion dysregulation is a hallmark symptom of increasing mood severity. However, measuring emotions typically requires self-assessment, a willful action outside of one's daily routine. In this paper, we describe a novel approach for collecting real-world natural speech data from daily life and measuring emotions from these data. The approach combines a novel data collection pipeline and validated robust emotion recognition models. We describe a deployment of this pipeline that included parallel clinical and self-report measures of mood and self-reported measures of emotion. Finally, we present approaches to estimate clinical and self-reported mood measures using a combination of passive and self-reported emotion measures. The results demonstrate that both passive and self-reported measures of emotion contribute to our ability to accurately estimate mood symptom severity for individuals with BD.more » « less
-
Artificial intelligence-based prostate cancer (PCa) detection models have been widely explored to assist clinical diagnosis. However, these trained models may generate erroneous results specifically on datasets that are not within training distribution. In this paper, we propose an approach to tackle this so-called out-of-distribution (OOD) data problem. Specifically, we devise an end-to-end unsupervised framework to estimate uncertainty values for cases analyzed by a previously trained PCa detection model. Our PCa detection model takes the inputs of bpMRI scans and through our proposed approach we identify OOD cases that are likely to generate degraded performance due to the data distribution shifts. The proposed OOD framework consists of two parts. First, an autoencoder-based reconstruction network is proposed, which learns discrete latent representations of in-distribution data. Second, the uncertainty is computed using perceptual loss that measures the distance between original and reconstructed images in the feature space of a pre-trained PCa detection network. The effectiveness of the proposed framework is evaluated on seven independent data collections with a total of 1,432 cases. The performance of pre-trained PCa detection model is significantly improved by excluding cases with high uncertainty.more » « less
An official website of the United States government

