- NSF-PAR ID:
- 10155761
- Date Published:
- Journal Name:
- The American Statistician
- ISSN:
- 0003-1305
- Page Range / eLocation ID:
- 1 to 21
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Abstract Motivation The standard bootstrap method is used throughout science and engineering to perform general-purpose non-parametric resampling and re-estimation. Among the most widely cited and widely used such applications is the phylogenetic bootstrap method, which Felsenstein proposed in 1985 as a means to place statistical confidence intervals on an estimated phylogeny (or estimate ‘phylogenetic support’). A key simplifying assumption of the bootstrap method is that input data are independent and identically distributed (i.i.d.). However, the i.i.d. assumption is an over-simplification for biomolecular sequence analysis, as Felsenstein noted. Results In this study, we introduce a new sequence-aware non-parametric resampling technique, which we refer to as RAWR (‘RAndom Walk Resampling’). RAWR consists of random walks that synthesize and extend the standard bootstrap method and the ‘mirrored inputs’ idea of Landan and Graur. We apply RAWR to the task of phylogenetic support estimation. RAWR’s performance is compared to the state-of-the-art using synthetic and empirical data that span a range of dataset sizes and evolutionary divergence. We show that RAWR support estimates offer comparable or typically superior type I and type II error compared to phylogenetic bootstrap support. We also conduct a re-analysis of large-scale genomic sequence data from a recent study of Darwin’s finches. Our findings clarify phylogenetic uncertainty in a charismatic clade that serves as an important model for complex adaptive evolution. Availability and implementation Data and software are publicly available under open-source software and open data licenses at: https://gitlab.msu.edu/liulab/RAWR-study-datasets-and-scripts.more » « less
-
Discrete-event simulation models generate random variates from input distributions and compute outputs according to the simulation logic. The input distributions are typically fitted to finite real-world data and thus are subject to estimation errors that can propagate to the simulation outputs: an issue commonly known as input uncertainty (IU). This paper investigates quantifying IU using the output confidence intervals (CIs) computed from bootstrap quantile estimators. The standard direct bootstrap method has overcoverage due to convolution of the simulation error and IU; however, the brute-force way of washing away the former is computationally demanding. We present two new bootstrap methods to enhance direct resampling in both statistical and computational efficiencies using shrinkage strategies to down-scale the variabilities encapsulated in the CIs. Our asymptotic analysis shows how both approaches produce tight CIs accounting for IU under limited input data and simulation effort along with the simulation sample-size requirements relative to the input data size. We demonstrate performances of the shrinkage strategies with several numerical experiments and investigate the conditions under which each method performs well. We also show advantages of nonparametric approaches over parametric bootstrap when the distribution family is misspecified and over metamodel approaches when the dimension of the distribution parameters is high.
History: Accepted by Bruno Tuffin, Area Editor for Simulation.
Funding: This work was supported by the National Science Foundation [CAREER CMMI-1834710, CAREER CMMI-2045400, DMS-1854659, and IIS-1849280].
Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2022.0044 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2022.0044 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
-
Statistical resampling methods are widely used for confidence interval placement and as a data perturbation technique for statistical inference and learning. An important assumption of popular resampling methods such as the standard bootstrap is that input observations are identically and independently distributed (i.i.d.). However, within the area of computational biology and bioinformatics, many different factors can contribute to intra-sequence dependence, such as recombination and other evolutionary processes governing sequence evolution. The SEquential RESampling (“SERES”) framework was previously proposed to relax the simplifying assumption of i.i.d. input observations. SERES resampling takes the form of random walks on an input of either aligned or unaligned biomolecular sequences. This study introduces the first application of SERES random walks on aligned sequence inputs and is also the first to demonstrate the utility of SERES as a data perturbation technique to yield improved statistical estimates. We focus on the classical problem of recombination-aware local genealogical inference. We show in a simulation study that coupling SERES resampling and re-estimation with recHMM, a hidden Markov model-based method, produces local genealogical inferences with consistent and often large improvements in terms of topological accuracy. We further evaluate method performance using an empirical HIV genome sequence dataset.more » « less
-
We consider inference for the parameters of a linear model when the covariates are random and the relationship between response and covariates is possibly non-linear. Conventional inference methods such as z intervals perform poorly in these cases. We propose a double bootstrap-based calibrated percentile method, perc-cal, as a general-purpose CI method which performs very well relative to alternative methods in challenging situations such as these. The superior performance of perc-cal is demonstrated by a thorough, full-factorial design synthetic data study as well as a data example involving the length of criminal sentences. We also provide theoretical justification for the perc-cal method under mild conditions. The method is implemented in the R package "perccal", available through CRAN and coded primarily in C++, to make it easier for practitioners to use.more » « less
-
Summary A within-cluster resampling method is proposed for fitting a multilevel model in the presence of informative cluster size. Our method is based on the idea of removing the information in the cluster sizes by drawing bootstrap samples which contain a fixed number of observations from each cluster. We then estimate the parameters by maximizing an average, over the bootstrap samples, of a suitable composite loglikelihood. The consistency of the proposed estimator is shown and does not require that the correct model for cluster size is specified. We give an estimator of the covariance matrix of the proposed estimator, and a test for the noninformativeness of the cluster sizes. A simulation study shows, as in Neuhaus & McCulloch (2011), that the standard maximum likelihood estimator exhibits little bias for some regression coefficients. However, for those parameters which exhibit nonnegligible bias, the proposed method is successful in correcting for this bias.