skip to main content


Title: Model-Based Clustering of High-Dimensional Longitudinal Data via Regularization
Abstract

We propose a model-based clustering method for high-dimensional longitudinal data via regularization in this paper. This study was motivated by the Trial of Activity in Adolescent Girls (TAAG), which aimed to examine multilevel factors related to the change of physical activity by following up a cohort of 783 girls over 10 years from adolescence to early adulthood. Our goal is to identify the intrinsic grouping of subjects with similar patterns of physical activity trajectories and the most relevant predictors within each group. The previous analyses conducted clustering and variable selection in two steps, while our new method can perform the tasks simultaneously. Within each cluster, a linear mixed-effects model (LMM) is fitted with a doubly penalized likelihood to induce sparsity for parameter estimation and effect selection. The large-sample joint properties are established, allowing the dimensions of both fixed and random effects to increase at an exponential rate of the sample size, with a general class of penalty functions. Assuming subjects are drawn from a Gaussian mixture distribution, model effects and cluster labels are estimated via a coordinate descent algorithm nested inside the Expectation-Maximization (EM) algorithm. Bayesian Information Criterion (BIC) is used to determine the optimal number of clusters and the values of tuning parameters. Our numerical studies show that the new method has satisfactory performance and is able to accommodate complex data with multilevel and/or longitudinal effects.

 
more » « less
Award ID(s):
1934962
NSF-PAR ID:
10485826
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
79
Issue:
2
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 761-774
Size(s):
["p. 761-774"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Harezlak, Jaroslaw (Ed.)
    We examined multi-level factors related to the longitudinal physical activity trajectories of adolescent girls to determine the important predictors for physical activity. The Trial of Activity in Adolescent Girls (TAAG) Maryland site recruited participants at age 14 ( n = 566) and followed up with these girls at age 17 ( n = 553) and age 23 ( n = 442). Individual, social factors and perceived environmental factors were assessed by questionnaire; body mass index was measured at age 14 and age 17, and self-reported at age 23. Neighborhood factors were assessed by geographic information systems. The outcome, moderate-to-vigorous physical activity (MVPA) minutes in a day, was assessed from accelerometers. A mixture of linear mixed-effects models with double penalization on fixed effects and random effects was used to identify the intrinsic grouping of participants with similar physical activity trajectory patterns and the most relevant predictors within the groups simultaneously. Three clusters of participants were identified. Two hundred and forty participants were clustered as “maintainers” and had consistently low MVPA over time; 289 participants were clustered as “decreasers” who had decreasing MVPA over time; 39 participants were grouped as “increasers” and had increasing MVPA over time. Each of the three clusters has its own cluster-specific factors identified using the clustering method, indicating that each cluster has unique characteristics. 
    more » « less
  2. Summary

    A new approach to clustering multivariate data, based on a multilevel linear mixed model, is proposed. A key feature of the model is that observations from the same cluster are correlated, because they share cluster-specific random effects. The inclusion of cluster-specific random effects allows parsimonious departure from an assumed base model for cluster mean profiles. This departure is captured statistically via the posterior expectation, or best linear unbiased predictor. One of the parameters in the model is the true underlying partition of the data, and the posterior distribution of this parameter, which is known up to a normalizing constant, is used to cluster the data. The problem of finding partitions with high posterior probability is not amenable to deterministic methods such as the EM algorithm. Thus, we propose a stochastic search algorithm that is driven by a Markov chain that is a mixture of two Metropolis–Hastings algorithms—one that makes small scale changes to individual objects and another that performs large scale moves involving entire clusters. The methodology proposed is fundamentally different from the well-known finite mixture model approach to clustering, which does not explicitly include the partition as a parameter, and involves an independent and identically distributed structure.

     
    more » « less
  3. Aim/Purpose: The research reported here aims to demonstrate a method by which novel applications of qualitative data in quantitative research can resolve ceiling effect tensions for educational and psychological research.Background: Self-report surveys and scales are essential to graduate education and social science research. Ceiling effects reflect the clustering of responses at the highest response categories resulting in non-linearity, a lack of variability which inhibits and distorts statistical analyses. Ceiling effects in stress reported by students can negatively impact the accuracy and utility of the resulting data.Methodology: A longitudinal sample example from graduate engineering students’ stress, open-ended critical events, and their early departure from doctoral study considerations demonstrate the utility and improved accuracy of adjusted stress measures to include open-ended critical event responses. Descriptive statistics are used to describe the ceiling effects in stress data and adjusted stress data. The longitudinal stress ratings were used to predict departure considerations in multilevel modeling ANCOVA analyses and demonstrate improved model predictiveness.Contribution: Combining qualitative data from open-ended responses with quantitative survey responses provides an opportunity to reduce ceiling effects and improve model performance in predicting graduate student persistence. Here, we present a method for adjusting stress scale responses by incorporating coded critical events based on the Taxonomy of Life Events, the application of this method in the analysis of stress responses in a longitudinal data set, and potential applications.Findings: The resulting process more effectively represents the doctoral student experience within statistical analyses. Stress and major life events significantly impact engineering doctoral students’ departure considerations.Recommendations for Practitioners: Graduate educators should be aware of students’ life events and assist students in managing graduate school expectations while maintaining progress toward their degree. Recommendation for Researchers: Integrating coded open-ended qualitative data into statistical models can increase the accuracy and representation of the lived student experience. The new approach improves the accuracy and presentation of students’ lived experiences by incorporating qualitative data into longitudinal analyses. The improvement assists researchers in correcting data with ceiling effects for use in longitudinal analyses.Impact on Society: The method described here provides a framework to systematically include open-ended qualitative data in which ceiling effects are present.Future Research: Future research should validate the coding process in similar samples and in samples of doctoral students in different fields and master’s students. 
    more » « less
  4. SUMMARY

    Differences between P- and S-wave models have been frequently used as evidence for the presence of large-scale compositional heterogeneity in the Earth's mantle. Our two-step machine learning (ML) analysis of 28 P- and S-wave global tomographic models reveals that, on a global scale, such differences are for the most part not intrinsic and could be reduced by changing the models in their respective null spaces. In other words, P- and S-wave images of mantle structure are not necessarily distinct from each other. Thus, a purely thermal explanation for large-scale seismic structure is sufficient at present; significant mantle compositional heterogeneities do not need to be invoked. We analyse 28 widely used tomographic models based on various theoretical approximations ranging from ray theory (e.g. UU-P07 and MIT-P08), Born scattering (e.g. DETOX) and full-waveform techniques (e.g. CSEM and GLAD). We apply Varimax principal component analysis to reduce tomography model dimensionality by 83 percent, while preserving relevant information (94 percent of the original variance), followed by hierarchical clustering (HC) analysis using Ward's method to quantitatively categorize all models into hierarchical groups based on similarities. We found two main tomography model clusters: Cluster 1, which we called ‘Pure P wave’, is composed of six P-wave models that only use longitudinal body wave phases (e.g. P, PP and Pdiff); and Cluster 2, which we called ‘Mixed’, includes both P- and S-wave models. P-wave models in the ‘Mixed’ cluster use inversion methods that include inputs from other geophysical and geological data sources, and this causes them to be more similar to S-wave models than Pure P-wave models without significant loss of fitness to P-wave data. Given that inclusion of new data classes and seismic phases in more recent tomographic models significantly changes imaged seismic structure, our ML assessment of global tomography model similarity may improve selection of appropriate P- and S-wave models for future global tomography comparative studies.

     
    more » « less
  5. Abstract

    Objective. This paper presents data-driven solutions to address two challenges in the problem of linking neural data and behavior: 1) unsupervised analysis of behavioral data and automatic label generation from behavioral observations, and 2) extraction of subject-invariant features for the development of generalized neural decoding models. Approach. For behavioral analysis and label generation, an unsupervised method, which employs an autoencoder to transform behavior data into a cluster-friendly feature space is presented. The model iteratively refines the assigned clusters with soft clustering assignment loss, and gradually improves the learned feature representations. To address subject variability in decoding neural activity, adversarial learning in combination with a long short-term memory-based adversarial variational autoencoder (LSTM-AVAE) model is employed. By using an adversary network to constrain the latent representations, the model captures shared information among subjects' neural activity, making it proper for cross-subject transfer learning. Main results. The proposed approach is evaluated using cortical recordings of Thy1-GCaMP6s transgenic mice obtained via widefield calcium imaging during a motivational licking behavioral experiment. The results show that the proposed model achieves an accuracy of 89.7% in cross-subject neural decoding, outperforming other well-known autoencoder-based feature learning models. These findings suggest that incorporating an adversary network eliminates subject dependency in representations, leading to improved cross-subject transfer learning performance, while also demonstrating the effectiveness of LSTM-based models in capturing the temporal dependencies within neural data. Significance. Results demonstrate the feasibility of the proposed framework in unsupervised clustering and label generation of behavioral data, as well as achieving high accuracy in cross-subject neural decoding, indicating its potentials for relating neural activity to behavior.

     
    more » « less