Abstract Phylodynamics is an area of population genetics that uses genetic sequence data to estimate past population dynamics. Modern state‐of‐the‐art Bayesian nonparametric methods for recovering population size trajectories of unknown form use either change‐point models or Gaussian process priors. Change‐point models suffer from computational issues when the number of change‐points is unknown and needs to be estimated. Gaussian process‐based methods lack local adaptivity and cannot accurately recover trajectories that exhibit features such as abrupt changes in trend or varying levels of smoothness. We propose a novel, locally adaptive approach to Bayesian nonparametric phylodynamic inference that has the flexibility to accommodate a large class of functional behaviors. Local adaptivity results from modeling the log‐transformed effective population size a priori as a horseshoe Markov random field, a recently proposed statistical model that blends together the best properties of the change‐point and Gaussian process modeling paradigms. We use simulated data to assess model performance, and find that our proposed method results in reduced bias and increased precision when compared to contemporary methods. We also use our models to reconstruct past changes in genetic diversity of human hepatitis C virus in Egypt and to estimate population size changes of ancient and modern steppe bison. These analyses show that our new method captures features of the population size trajectories that were missed by the state‐of‐the‐art methods.
more »
« less
On Functional Processes with Multiple Discontinuities
Abstract We consider the problem of estimating multiple change points for a functional data process. There are numerous examples in science and finance in which the process of interest may be subject to some sudden changes in the mean. The process data that are not in a close vicinity of any change point can be analysed by the usual nonparametric smoothing methods. However, the data close to change points and contain the most pertinent information of structural breaks need to be handled with special care. This paper considers a half-kernel approach that addresses the inference of the total number, locations and jump sizes of the changes. Convergence rates and asymptotic distributional results for the proposed procedures are thoroughly investigated. Simulations are conducted to examine the performance of the approach, and a number of real data sets are analysed to provide an illustration.
more »
« less
- Award ID(s):
- 1916226
- PAR ID:
- 10398620
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- Volume:
- 84
- Issue:
- 3
- ISSN:
- 1369-7412
- Format(s):
- Medium: X Size: p. 933-972
- Size(s):
- p. 933-972
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Online health communities (OHCs) provide free, open, and well-resourced platforms for patients, family members, and others to discuss illnesses, express feelings, and connect with others. Linguistic analysis of OHC posts can assist in better understanding disease conditions as well as monitoring the emotional and mental status of patients and those who are closely related. Many existing OHC linguistic analyses are limited by focusing on individual words. There are a handful of cooccurrence network analyses, which have multiple methodological limitations. In this article we analyze posts that are publicly available at the LUNGevity Foundation’s Lung Cancer Support Community (LCSC). The analyzed data contains 21,028 posts published between April 2018 and February 2022. For word cooccurrence network analysis, we develop a two-part latent space model, which advances from the existing ones by accommodating network weights. Further, we consider the scenario where there are change points in time, networks remain the same between two change points but differ on the two sides of a change point, and the number and locations of change points are unknown. A penalized fusion approach is developed to data-dependently determine change points and estimate networks. In data analysis multiple change points are identified, which reflect significant changes in lung cancer patients’ and their close affiliates’ emotional/mental status and mostly align with the changes in COVID-19. The obtained network structures and other findings are also sensible.more » « less
-
Abstract Without imposing prior distributional knowledge underlying multivariate time series of interest, we propose a nonparametric change-point detection approach to estimate the number of change points and their locations along the temporal axis. We develop a structural subsampling procedure such that the observations are encoded into multiple sequences of Bernoulli variables. A maximum likelihood approach in conjunction with a newly developed searching algorithm is implemented to detect change points on each Bernoulli process separately. Then, aggregation statistics are proposed to collectively synthesize change-point results from all individual univariate time series into consistent and stable location estimations. We also study a weighting strategy to measure the degree of relevance for different subsampled groups. Simulation studies are conducted and shown that the proposed change-point methodology for multivariate time series has favorable performance comparing with currently available state-of-the-art nonparametric methods under various settings with different degrees of complexity. Real data analyses are finally performed on categorical, ordinal, and continuous time series taken from fields of genetics, climate, and finance.more » « less
-
Abstract We introduce a novel geometry‐oriented methodology, based on the emerging tools of topological data analysis, into the change‐point detection framework. The key rationale is that change points are likely to be associated with changes in geometry behind the data‐generating process. While the applications of topological data analysis to change‐point detection are potentially very broad, in this paper, we primarily focus on integrating topological concepts with the existing nonparametric methods for change‐point detection. In particular, the proposed new geometry‐oriented approach aims to enhance detection accuracy of distributional regime shift locations. Our simulation studies suggest that integration of topological data analysis with some existing algorithms for change‐point detection leads to consistently more accurate detection results. We illustrate our new methodology in application to the two closely related environmental time series data sets—ice phenology of the Lake Baikal and the North Atlantic Oscillation indices, in a research query for a possible association between their estimated regime shift locations.more » « less
-
Abstract We consider inference problems for high-dimensional (HD) functional data with a dense number of T repeated measurements taken for a large number of p variables from a small number of n experimental units. The spatial and temporal dependence, high dimensionality, and dense number of repeated measurements pose theoretical and computational challenges. This paper has two aims; our first aim is to solve the theoretical and computational challenges in testing equivalence among covariance matrices from HD functional data. The second aim is to provide computationally efficient and tuning-free tools with guaranteed stochastic error control. The weak convergence of the stochastic process formed by the test statistics is established under the “large p, large T, and small n” setting. If the null is rejected, we further show that the locations of the change points can be estimated consistently. The estimator's rate of convergence is shown to depend on the data dimension, sample size, number of repeated measurements, and signal-to-noise ratio. We also show that our proposed computation algorithms can significantly reduce the computation time and are applicable to real-world data with a large number of HD-repeated measurements (e.g., functional magnetic resonance imaging (fMRI) data). Simulation results demonstrate both the finite sample performance and computational effectiveness of our proposed procedures. We observe that the empirical size of the test is well controlled at the nominal level, and the locations of multiple change points can be accurately identified. An application to fMRI data demonstrates that our proposed methods can identify event boundaries in the preface of the television series Sherlock. Code to implement the procedures is available in an R package named TechPhD.more » « less
An official website of the United States government
