skip to main content


Title: Scalable multiple changepoint detection for functional data sequences
Abstract

We propose the multiple changepoint isolation (MCI) method for detecting multiple changes in the mean and covariance of a functional process. We first introduce a pair of projections to represent the variability “between” and “within” the functional observations. We then present an augmented fused lasso procedure to split the projections into multiple regions robustly. These regions act to isolate each changepoint away from the others so that the powerful univariate CUSUM statistic can be applied region‐wise to identify the changepoints. Simulations show that our method accurately detects the number and locations of changepoints under many different scenarios. These include light and heavy tailed data, data with symmetric and skewed distributions, sparsely and densely sampled changepoints, and mean and covariance changes. We show that our method outperforms a recent multiple functional changepoint detector and several univariate changepoint detectors applied to our proposed projections. We also show that MCI is more robust than existing approaches and scales linearly with sample size. Finally, we demonstrate our method on a large time series of water vapor mixing ratio profiles from atmospheric emitted radiance interferometer measurements.

 
more » « less
Award ID(s):
1830312 2124576
NSF-PAR ID:
10363907
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Environmetrics
Volume:
33
Issue:
2
ISSN:
1180-4009
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In this paper, we consider sequentially estimating the density of univariate data. We utilize Pólya trees to develop a statistical process control (SPC) methodology. Our proposed methodology monitors the distribution of the sequentially observed data and detects when the generating density differs from an in‐control standard. We also propose an approximation that merges the probability mass of multiple possible changepoints to curb computational complexity while maintaining the accuracy of the monitoring procedure. We show in simulation experiments that our approach is capable of quickly detecting when a changepoint has occurred while controlling the number of false alarms, and performs well relative to competing methods. We then use our methodology to detect changepoints in high‐frequency foreign exchange (Forex) return data.

     
    more » « less
  2. Abstract

    Statistical bias correction techniques are commonly used in climate model projections to reduce systematic biases. Among the several bias correction techniques, univariate linear bias correction (e.g., quantile mapping) is the most popular, given its simplicity. Univariate linear bias correction can accurately reproduce the observed mean of a given climate variable. However, when performed separately on multiple variables, it does not yield the observed multivariate cross‐correlation structure. In the current study, we consider the intrinsic properties of two candidate univariate linear bias‐correction approaches (simple linear regression and asynchronous regression) in estimating the observed cross‐correlation between precipitation and temperature. Two linear regression models are applied separately on both the observed and the projected variables. The analytical solution suggests that two candidate approaches simply reproduce the cross‐correlation from the general circulation models (GCMs) in the bias‐corrected data set because of their linearity. Our study adopts two frameworks, based on the Fisherz‐transformation and bootstrapping, to provide 95% lower and upper confidence limits (referred as the permissible bound) for the GCM cross‐correlation. Beyond the permissible bound, raw/bias‐corrected GCM cross‐correlation significantly differs from those observed. Two frameworks are applied on three GCMs from the CMIP5 multimodel ensemble over the coterminous United States. We found that (a) the univariate linear techniques fail to reproduce the observed cross‐correlation in the bias‐corrected data set over 90% (30–50%) of the grid points where the multivariate skewness coefficient values are substantial (small) and statistically significant (statistically insignificant) from zero; (b) the performance of the univariate linear techniques under bootstrapping (Fisherz‐transformation) remains uniform (non‐uniform) across climate regions, months, and GCMs; (c) grid points, where the observed cross‐correlation is statistically significant, witness a failure fraction of around 0.2 (0.8) under the Fisherz‐transformation (bootstrapping). The importance of reproducing cross‐correlations is also discussed along with an enquiry into the multivariate approaches that can potentially address the bias in yielding cross‐correlations.

     
    more » « less
  3. Summary

    We propose a fast penalized spline method for bivariate smoothing. Univariate P-spline smoothers are applied simultaneously along both co-ordinates. The new smoother has a sandwich form which suggested the name ‘sandwich smoother’ to a referee. The sandwich smoother has a tensor product structure that simplifies an asymptotic analysis and it can be fast computed. We derive a local central limit theorem for the sandwich smoother, with simple expressions for the asymptotic bias and variance, by showing that the sandwich smoother is asymptotically equivalent to a bivariate kernel regression estimator with a product kernel. As far as we are aware, this is the first central limit theorem for a bivariate spline estimator of any type. Our simulation study shows that the sandwich smoother is orders of magnitude faster to compute than other bivariate spline smoothers, even when the latter are computed by using a fast generalized linear array model algorithm, and comparable with them in terms of mean integrated squared errors. We extend the sandwich smoother to array data of higher dimensions, where a generalized linear array model algorithm improves the computational speed of the sandwich smoother. One important application of the sandwich smoother is to estimate covariance functions in functional data analysis. In this application, our numerical results show that the sandwich smoother is orders of magnitude faster than local linear regression. The speed of the sandwich formula is important because functional data sets are becoming quite large.

     
    more » « less
  4. Abstract

    Climate changepoint (homogenization) methods abound today, with a myriad of techniques existing in both the climate and statistics literature. Unfortunately, the appropriate changepoint technique to use remains unclear to many. Further complicating issues, changepoint conclusions are not robust to perturbations in assumptions; for example, allowing for a trend or correlation in the series can drastically change changepoint conclusions. This paper is a review of the topic, with an emphasis on illuminating the models and techniques that allow the scientist to make reliable conclusions. Pitfalls to avoid are demonstrated via actual applications. The discourse begins by narrating the salient statistical features of most climate time series. Thereafter, single- and multiple-changepoint problems are considered. Several pitfalls are discussed en route and good practices are recommended. While most of our applications involve temperatures, a sea ice series is also considered.

    Significance Statement

    This paper reviews the methods used to identify and analyze the changepoints in climate data, with a focus on helping scientists make reliable conclusions. The paper discusses common mistakes and pitfalls to avoid in changepoint analysis and provides recommendations for best practices. The paper also provides examples of how these methods have been applied to temperature and sea ice data. The main goal of the paper is to provide guidance on how to effectively identify the changepoints in climate time series and homogenize the series.

     
    more » « less
  5. Abstract Background

    In Alzheimer’s Diseases (AD) research, multimodal imaging analysis can unveil complementary information from multiple imaging modalities and further our understanding of the disease. One application is to discover disease subtypes using unsupervised clustering. However, existing clustering methods are often applied to input features directly, and could suffer from the curse of dimensionality with high-dimensional multimodal data. The purpose of our study is to identify multimodal imaging-driven subtypes in Mild Cognitive Impairment (MCI) participants using a multiview learning framework based on Deep Generalized Canonical Correlation Analysis (DGCCA), to learn shared latent representation with low dimensions from 3 neuroimaging modalities.

    Results

    DGCCA applies non-linear transformation to input views using neural networks and is able to learn correlated embeddings with low dimensions that capture more variance than its linear counterpart, generalized CCA (GCCA). We designed experiments to compare DGCCA embeddings with single modality features and GCCA embeddings by generating 2 subtypes from each feature set using unsupervised clustering. In our validation studies, we found that amyloid PET imaging has the most discriminative features compared with structural MRI and FDG PET which DGCCA learns from but not GCCA. DGCCA subtypes show differential measures in 5 cognitive assessments, 6 brain volume measures, and conversion to AD patterns. In addition, DGCCA MCI subtypes confirmed AD genetic markers with strong signals that existing late MCI group did not identify.

    Conclusion

    Overall, DGCCA is able to learn effective low dimensional embeddings from multimodal data by learning non-linear projections. MCI subtypes generated from DGCCA embeddings are different from existing early and late MCI groups and show most similarity with those identified by amyloid PET features. In our validation studies, DGCCA subtypes show distinct patterns in cognitive measures, brain volumes, and are able to identify AD genetic markers. These findings indicate the promise of the imaging-driven subtypes and their power in revealing disease structures beyond early and late stage MCI.

     
    more » « less