skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Model-Based Clustering of High-Dimensional Longitudinal Data via Regularization
Abstract We propose a model-based clustering method for high-dimensional longitudinal data via regularization in this paper. This study was motivated by the Trial of Activity in Adolescent Girls (TAAG), which aimed to examine multilevel factors related to the change of physical activity by following up a cohort of 783 girls over 10 years from adolescence to early adulthood. Our goal is to identify the intrinsic grouping of subjects with similar patterns of physical activity trajectories and the most relevant predictors within each group. The previous analyses conducted clustering and variable selection in two steps, while our new method can perform the tasks simultaneously. Within each cluster, a linear mixed-effects model (LMM) is fitted with a doubly penalized likelihood to induce sparsity for parameter estimation and effect selection. The large-sample joint properties are established, allowing the dimensions of both fixed and random effects to increase at an exponential rate of the sample size, with a general class of penalty functions. Assuming subjects are drawn from a Gaussian mixture distribution, model effects and cluster labels are estimated via a coordinate descent algorithm nested inside the Expectation-Maximization (EM) algorithm. Bayesian Information Criterion (BIC) is used to determine the optimal number of clusters and the values of tuning parameters. Our numerical studies show that the new method has satisfactory performance and is able to accommodate complex data with multilevel and/or longitudinal effects.  more » « less
Award ID(s):
1934962
PAR ID:
10485826
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
79
Issue:
2
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 761-774
Size(s):
p. 761-774
Sponsoring Org:
National Science Foundation
More Like this
  1. Harezlak, Jaroslaw (Ed.)
    We examined multi-level factors related to the longitudinal physical activity trajectories of adolescent girls to determine the important predictors for physical activity. The Trial of Activity in Adolescent Girls (TAAG) Maryland site recruited participants at age 14 ( n = 566) and followed up with these girls at age 17 ( n = 553) and age 23 ( n = 442). Individual, social factors and perceived environmental factors were assessed by questionnaire; body mass index was measured at age 14 and age 17, and self-reported at age 23. Neighborhood factors were assessed by geographic information systems. The outcome, moderate-to-vigorous physical activity (MVPA) minutes in a day, was assessed from accelerometers. A mixture of linear mixed-effects models with double penalization on fixed effects and random effects was used to identify the intrinsic grouping of participants with similar physical activity trajectory patterns and the most relevant predictors within the groups simultaneously. Three clusters of participants were identified. Two hundred and forty participants were clustered as “maintainers” and had consistently low MVPA over time; 289 participants were clustered as “decreasers” who had decreasing MVPA over time; 39 participants were grouped as “increasers” and had increasing MVPA over time. Each of the three clusters has its own cluster-specific factors identified using the clustering method, indicating that each cluster has unique characteristics. 
    more » « less
  2. Summary With rapid development of techniques to measure brain activity and structure, statistical methods for analyzing modern brain-imaging data play an important role in the advancement of science. Imaging data that measure brain function are usually multivariate high-density longitudinal data and are heterogeneous across both imaging sources and subjects, which lead to various statistical and computational challenges. In this article, we propose a group-based method to cluster a collection of multivariate high-density longitudinal data via a Bayesian mixture of smoothing splines. Our method assumes each multivariate high-density longitudinal trajectory is a mixture of multiple components with different mixing weights. Time-independent covariates are assumed to be associated with the mixture components and are incorporated via logistic weights of a mixture-of-experts model. We formulate this approach under a fully Bayesian framework using Gibbs sampling where the number of components is selected based on a deviance information criterion. The proposed method is compared to existing methods via simulation studies and is applied to a study on functional near-infrared spectroscopy, which aims to understand infant emotional reactivity and recovery from stress. The results reveal distinct patterns of brain activity, as well as associations between these patterns and selected covariates. 
    more » « less
  3. How to cluster event sequences generated via different point processes is an interesting and important problem in statistical machine learning. To solve this problem, we propose and discuss an effective model-based clustering method based on a novel Dirichlet mixture model of a special but significant type of point processes — Hawkes process. The proposed model generates the event sequences with different clusters from the Hawkes processes with different parameters, and uses a Dirichlet distribution as the prior distribution of the clusters. We prove the identifiability of our mixture model and propose an effective variational Bayesian inference algorithm to learn our model. An adaptive inner iteration allocation strategy is designed to accelerate the convergence of our algorithm. Moreover, we investigate the sample complexity and the computational complexity of our learning algorithm in depth. Experiments on both synthetic and real-world data show that the clustering method based on our model can learn structural triggering patterns hidden in asynchronous event sequences robustly and achieve superior performance on clustering purity and consistency compared to existing methods. 
    more » « less
  4. Modern neural recording techniques allow neuroscientists to observe the spiking activity of many neurons simultaneously. Although previous work has illustrated how activity within and between known populations of neurons can be summarized by low-dimensional latent vectors, in many cases what determines a unique population may be unclear. Neurons differ in their anatomical location, but also, in their cell types and response properties. Moreover, multiple distinct populations may not be well described by a single low-dimensional, linear representation.To tackle these challenges, we develop a clustering method based on a mixture of dynamic Poisson factor analyzers (DPFA) model, with the number of clusters treated as an unknown parameter. To do the analysis of DPFA model, we propose a novel Markov chain Monte Carlo (MCMC) algorithm to efficiently sample its posterior distribution. Validating our proposed MCMC algorithm with simulations, we find that it can accurately recover the true clustering and latent states and is insensitive to the initial cluster assignments. We then apply the proposed mixture of DPFA model to multi-region experimental recordings, where we find that the proposed method can identify novel, reliable clusters of neurons based on their activity, and may, thus, be a useful tool for neural data analysis. 
    more » « less
  5. null (Ed.)
    ABSTRACT We construct and validate the selection function of the MARD-Y3 galaxy cluster sample. This sample was selected through optical follow-up of the 2nd ROSAT faint source catalogue with Dark Energy Survey year 3 data. The selection function is modelled by combining an empirically constructed X-ray selection function with an incompleteness model for the optical follow-up. We validate the joint selection function by testing the consistency of the constraints on the X-ray flux–mass and richness–mass scaling relation parameters derived from different sources of mass information: (1) cross-calibration using South Pole Telescope Sunyaev-Zel'dovich (SPT-SZ) clusters, (2) calibration using number counts in X-ray, in optical and in both X-ray and optical while marginalizing over cosmological parameters, and (3) other published analyses. We find that the constraints on the scaling relation from the number counts and SPT-SZ cross-calibration agree, indicating that our modelling of the selection function is adequate. Furthermore, we apply a largely cosmology independent method to validate selection functions via the computation of the probability of finding each cluster in the SPT-SZ sample in the MARD-Y3 sample and vice versa. This test reveals no clear evidence for MARD-Y3 contamination, SPT-SZ incompleteness or outlier fraction. Finally, we discuss the prospects of the techniques presented here to limit systematic selection effects in future cluster cosmological studies. 
    more » « less