There is a growing body of research revealing that longitudinal passive sensing data from smartphones and wearable devices can capture daily behavior signals for human behavior modeling, such as depression detection. Most prior studies build and evaluate machine learning models using data collected from a single population. However, to ensure that a behavior model can work for a larger group of users, its generalizability needs to be verified on multiple datasets from different populations. We present the first work evaluating cross-dataset generalizability of longitudinal behavior models, using depression detection as an application. We collect multiple longitudinal passive mobile sensing datasets with over 500 users from two institutes over a two-year span, leading to four institute-year datasets. Using the datasets, we closely re-implement and evaluated nine prior depression detection algorithms. Our experiment reveals the lack of model generalizability of these methods. We also implement eight recently popular domain generalization algorithms from the machine learning community. Our results indicate that these methods also do not generalize well on our datasets, with barely any advantage over the naive baseline of guessing the majority. We then present two new algorithms with better generalizability. Our new algorithm, Reorder, significantly and consistently outperforms existing methods on most cross-dataset generalization setups. However, the overall advantage is incremental and still has great room for improvement. Our analysis reveals that the individual differences (both within and between populations) may play the most important role in the cross-dataset generalization challenge. Finally, we provide an open-source benchmark platform GLOBEM- short for Generalization of Longitudinal BEhavior Modeling - to consolidate all 19 algorithms. GLOBEM can support researchers in using, developing, and evaluating different longitudinal behavior modeling methods. We call for researchers' attention to model generalizability evaluation for future longitudinal human behavior modeling studies.
more »
« less
This content will become publicly available on June 9, 2026
CRoP: Context-wise Robust Static Human-Sensing Personalization
The advancement in deep learning and internet-of-things have led to diverse human sensing applications. However, distinct patterns in human sensing, influenced by various factors or contexts, challenge the generic neural network model's performance due to natural distribution shifts. To address this, personalization tailors models to individual users. Yet most personalization studies overlook intra-user heterogeneity across contexts in sensory data, limiting intra-user generalizability. This limitation is especially critical in clinical applications, where limited data availability hampers both generalizability and personalization. Notably, intra-user sensing attributes are expected to change due to external factors such as treatment progression, further complicating the challenges. To address the intra-user generalization challenge, this work introduces CRoP, a novel static personalization approach. CRoP leverages off-the-shelf pre-trained models as generic starting points and captures user-specific traits through adaptive pruning on a minimal sub-network while allowing generic knowledge to be incorporated in remaining parameters. CRoP demonstrates superior personalization effectiveness and intra-user robustness across four human-sensing datasets, including two from real-world health domains, underscoring its practical and social impact. Additionally, to support CRoP's generalization ability and design choices, we provide empirical justification through gradient inner product analysis, ablation studies, and comparisons against state-of-the-art baselines.
more »
« less
- PAR ID:
- 10657056
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
- Volume:
- 9
- Issue:
- 2
- ISSN:
- 2474-9567
- Page Range / eLocation ID:
- 1 to 34
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Chen, Chi-Hua (Ed.)Mobile sensing data processed using machine learning models can passively and remotely assess mental health symptoms from the context of patients’ lives. Prior work has trained models using data from single longitudinal studies, collected from demographically homogeneous populations, over short time periods, using a single data collection platform or mobile application. The generalizability of model performance across studies has not been assessed. This study presents a first analysis to understand if models trained using combined longitudinal study data to predict mental health symptoms generalize across current publicly available data. We combined data from the CrossCheck (individuals living with schizophrenia) and StudentLife (university students) studies. In addition to assessing generalizability, we explored if personalizing models to align mobile sensing data, and oversampling less-represented severe symptoms, improved model performance. Leave-one-subject-out cross-validation (LOSO-CV) results were reported. Two symptoms (sleep quality and stress) had similar question-response structures across studies and were used as outcomes to explore cross-dataset prediction. Models trained with combined data were more likely to be predictive (significant improvement over predicting training data mean) than models trained with single-study data. Expected model performance improved if the distance between training and validation feature distributions decreased using combined versus single-study data. Personalization aligned each LOSO-CV participant with training data, but only improved predicting CrossCheck stress. Oversampling significantly improved severe symptom classification sensitivity and positive predictive value, but decreased model specificity. Taken together, these results show that machine learning models trained on combined longitudinal study data may generalize across heterogeneous datasets. We encourage researchers to disseminate collected de-identified mobile sensing and mental health symptom data, and further standardize data types collected across studies to enable better assessment of model generalizability.more » « less
-
The paper presents AIIM, an Artificial Intelligence (AI) enabled personalIzation Management software for human-in-the-loop, human-in-the-plant Learning enabled systems (LES). AIIM can be integrated with LES software to aid a human user to achieve safe and effective operation under dynamically changing contexts. AIIM consists of: A) an AI technique to derive model coefficient of a physics guided surrogate model from operational data shared following privacy norms, and b) continuous model conformance to identify key changes in LES operational behavior that may jeopardize safety. We demonstrate two capabilities of AIIM, personalization and unknown error detection, through case studies that span a significant breadth of dynamic context change scenarios including: a) involuntary change in user context such as medication induced glucose metabolism change in automated insulin delivery (AID), b) actuation failure such as cartridge blockage in AID, c) latent sensor error in aviation, and d) unknown coding error in autonomous car software patches. We compare AIIM personalization with human-in-the-loop and self-adaptive model-predictive control design on real-life and simulation settings, to show safe and improved diabetes management.more » « less
-
Stress affects physical and mental health, and wearable devices have been widely used to detect daily stress through physiological signals. However, these signals vary due to factors such as individual differences and health conditions, making generalizing machine learning models difficult. To address these challenges, we present Human Heterogeneity Invariant Stress Sensing (HHISS), a domain generalization approach designed to find consistent patterns in stress signals by removing person-specific differences. This helps the model perform more accurately across new people, environments, and stress types not seen during training. Its novelty lies in proposing a novel technique called person-wise sub-network pruning intersection to focus on shared features across individuals, alongside preventing overfitting by leveraging continuous labels while training. The present study focuses on people with opioid use disorder (OUD)---a group where stress responses can change dramatically depending on the presents of opioids in their system, including daily timed medication for OUD (MOUD). Since stress often triggers cravings, a model that can adapt well to these changes could support better OUD rehabilitation and recovery. We tested HHISS on seven different stress datasets---four which we collected ourselves and three public datasets. Four are from lab setups, one from a controlled real-world driving setting, and two are from real-world in-the-wild field datasets with no constraints. The present study is the first known to evaluate how well a stress detection model works across such a wide range of data. Results show HHISS consistently outperformed state-of-the-art baseline methods, proving both effective and practical for real-world use. Ablation studies, empirical justifications, and runtime evaluations confirm HHISS's feasibility and scalability for mobile stress sensing in sensitive real-world applications.more » « less
-
Recent advancements in large language models have spurred significant developments in Time Series Foundation Models (TSFMs). These models claim great promise in performing zero-shot forecasting without the need for specific training, leveraging the extensive "corpus" of time-series data they have been trained on. Forecasting is crucial in predictive building analytics, presenting substantial untapped potential for TSFMS in this domain. However, time-series data are often domain-specific and governed by diverse factors such as deployment environments, sensor characteristics, sampling rate, and data resolution, which complicates generalizability of these models across different contexts. Thus, while language models benefit from the relative uniformity of text data, TSFMs face challenges in learning from heterogeneous and contextually varied time-series data to ensure accurate and reliable performance in various applications. This paper seeks to understand how recently developed TSFMs perform in the building domain, particularly concerning their generalizability. We benchmark these models on three large datasets related to indoor air temperature and electricity usage. Our results indicate that TSFMs exhibit marginally better performance compared to statistical models on unseen sensing modality and/or patterns. Based on the benchmark results, we also provide insights for improving future TSFMs on building analytics.more » « less
An official website of the United States government
