skip to main content


Title: Applications of the Fractional-Random-Weight Bootstrap
For several decades, the resampling based bootstrap has been widely used for computing confidence intervals (CIs) for applications where no exact method is available. However, there are many applications where the resampling bootstrap method cannot be used. These include situations where the data are heavily censored due to the success response being a rare event, situations where there is insufficient mixing of successes and failures across the explanatory variable(s), and designed experiments where the number of parameters is close to the number of observations. These three situations all have in common that there may be a substantial proportion of the resamples where it is not possible to estimate all of the parameters in the model. This article reviews the fractional-random-weight bootstrap method and demonstrates how it can be used to avoid these problems and construct CIs in a way that is accessible to statistical practitioners. The fractional-random-weight bootstrap method is easy to use and has advantages over the resampling method in many challenging applications.  more » « less
Award ID(s):
1904165 1838271
NSF-PAR ID:
10155761
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
The American Statistician
ISSN:
0003-1305
Page Range / eLocation ID:
1 to 21
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Motivation The standard bootstrap method is used throughout science and engineering to perform general-purpose non-parametric resampling and re-estimation. Among the most widely cited and widely used such applications is the phylogenetic bootstrap method, which Felsenstein proposed in 1985 as a means to place statistical confidence intervals on an estimated phylogeny (or estimate ‘phylogenetic support’). A key simplifying assumption of the bootstrap method is that input data are independent and identically distributed (i.i.d.). However, the i.i.d. assumption is an over-simplification for biomolecular sequence analysis, as Felsenstein noted. Results In this study, we introduce a new sequence-aware non-parametric resampling technique, which we refer to as RAWR (‘RAndom Walk Resampling’). RAWR consists of random walks that synthesize and extend the standard bootstrap method and the ‘mirrored inputs’ idea of Landan and Graur. We apply RAWR to the task of phylogenetic support estimation. RAWR’s performance is compared to the state-of-the-art using synthetic and empirical data that span a range of dataset sizes and evolutionary divergence. We show that RAWR support estimates offer comparable or typically superior type I and type II error compared to phylogenetic bootstrap support. We also conduct a re-analysis of large-scale genomic sequence data from a recent study of Darwin’s finches. Our findings clarify phylogenetic uncertainty in a charismatic clade that serves as an important model for complex adaptive evolution. Availability and implementation Data and software are publicly available under open-source software and open data licenses at: https://gitlab.msu.edu/liulab/RAWR-study-datasets-and-scripts. 
    more » « less
  2. Abstract

    Despite the wide application of meta‐analysis in ecology, some of the traditional methods used for meta‐analysis may not perform well given the type of data characteristic of ecological meta‐analyses.

    We reviewed published meta‐analyses on the ecological impacts of global climate change, evaluating the number of replicates used in the primary studies (ni) and the number of studies or records (k) that were aggregated to calculate a mean effect size. We used the results of the review in a simulation experiment to assess the performance of conventional frequentist and Bayesian meta‐analysis methods for estimating a mean effect size and its uncertainty interval.

    Our literature review showed thatniandkwere highly variable, distributions were right‐skewed and were generally small (medianni = 5, mediank = 44). Our simulations show that the choice of method for calculating uncertainty intervals was critical for obtaining appropriate coverage (close to the nominal value of 0.95). Whenkwas low (<40), 95% coverage was achieved by a confidence interval (CI) based on thetdistribution that uses an adjusted standard error (the Hartung–Knapp–Sidik–Jonkman, HKSJ), or by a Bayesian credible interval, whereas bootstrap orzdistribution CIs had lower coverage. Despite the importance of the method to calculate the uncertainty interval, 39% of the meta‐analyses reviewed did not report the method used, and of the 61% that did, 94% used a potentially problematic method, which may be a consequence of software defaults.

    In general, for a simple random‐effects meta‐analysis, the performance of the best frequentist and Bayesian methods was similar for the same combinations of factors (kand mean replication), though the Bayesian approach had higher than nominal (>95%) coverage for the mean effect whenkwas very low (k < 15). Our literature review suggests that many meta‐analyses that usedzdistribution or bootstrapping CIs may have overestimated the statistical significance of their results when the number of studies was low; more appropriate methods need to be adopted in ecological meta‐analyses.

     
    more » « less
  3. Statistical resampling methods are widely used for confidence interval placement and as a data perturbation technique for statistical inference and learning. An important assumption of popular resampling methods such as the standard bootstrap is that input observations are identically and independently distributed (i.i.d.). However, within the area of computational biology and bioinformatics, many different factors can contribute to intra-sequence dependence, such as recombination and other evolutionary processes governing sequence evolution. The SEquential RESampling (“SERES”) framework was previously proposed to relax the simplifying assumption of i.i.d. input observations. SERES resampling takes the form of random walks on an input of either aligned or unaligned biomolecular sequences. This study introduces the first application of SERES random walks on aligned sequence inputs and is also the first to demonstrate the utility of SERES as a data perturbation technique to yield improved statistical estimates. We focus on the classical problem of recombination-aware local genealogical inference. We show in a simulation study that coupling SERES resampling and re-estimation with recHMM, a hidden Markov model-based method, produces local genealogical inferences with consistent and often large improvements in terms of topological accuracy. We further evaluate method performance using an empirical HIV genome sequence dataset. 
    more » « less
  4. null (Ed.)
    SUMMARY Horizontal slowness vector measurements using array techniques have been used to analyse many Earth phenomena from lower mantle heterogeneity to meteorological event location. While providing observations essential for studying much of the Earth, slowness vector analysis is limited by the necessary and subjective visual inspection of observations. Furthermore, it is challenging to determine the uncertainties caused by limitations of array processing such as array geometry, local structure, noise and their effect on slowness vector measurements. To address these issues, we present a method to automatically identify seismic arrivals and measure their slowness vector properties with uncertainty bounds. We do this by bootstrap sampling waveforms, therefore also creating random sub arrays, then use linear beamforming to measure the coherent power at a range of slowness vectors. For each bootstrap sample, we take the top N peaks from each power distribution as the slowness vectors of possible arrivals. The slowness vectors of all bootstrap samples are gathered and the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is used to identify arrivals as clusters of slowness vectors. The mean of slowness vectors in each cluster gives the slowness vector measurement for that arrival and the distribution of slowness vectors in each cluster gives the uncertainty estimate. We tuned the parameters of DBSCAN using a data set of 2489 SKS and SKKS observations at a range of frequency bands from 0.1 to 1 Hz. We then present examples at higher frequencies (0.5–2.0 Hz) than the tuning data set, identifying PKP precursors, and lower frequency by identifying multipathing in surface waves (0.04–0.06 Hz). While we use a linear beamforming process, this method can be implemented with any beamforming process such as cross correlation beamforming or phase weighted stacking. This method allows for much larger data sets to be analysed without visual inspection of data. Phenomena such as multipathing, reflections or scattering can be identified automatically in body or surface waves and their properties analysed with uncertainties. 
    more » « less
  5. Quantifying uncertainty in forest assessments is challenging because of the number of sources of error and the many possible approaches to quantify and propagate them. The uncertainty in allometric equations has sometimes been represented by propagating uncertainty only in the prediction of individuals, but at large scales with large numbers of trees uncertainty in model fit is more important than uncertainty in individuals. We compared four different approaches to representing model uncertainty: a formula for the confidence interval, Monte Carlo sampling of the slope and intercept of the regression, bootstrap resampling of the allometric data, and a Bayesian approach. We applied these approaches to propagating model uncertainty at four different scales of tree inventory (10 to 10,000 trees) for four study sites with varying allometry and model fit statistics, ranging from a monocultural plantation to a multi-species shrubland with multi-stemmed trees. We found that the four approaches to quantifying uncertainty in model fit were in good agreement, except that bootstrapping resulted in higher uncertainty at the site with the fewest trees in the allometric data set (48), because outliers could be represented multiple times or not at all in each sample. The uncertainty in model fit did not vary with the number of trees in the inventory to which it was applied. In contrast, the uncertainty in predicting individuals was higher than model fit uncertainty when applied to small numbers of trees, but became negligible with 10,000 trees. The importance of this uncertainty source varied with the forest type, being largest for the shrubland, where the model fit was most poor. Low uncertainties were observed where model fit was high, as was the case in the monoculture plantation and in the subtropical jungle where hundreds of trees contributed to the allometric model. In all cases, propagating uncertainty only in the prediction of individuals would underestimate allometric uncertainty. It will always be most correct to include both uncertainty in predicting individuals and uncertainty in model fit, but when large numbers of individuals are involved, as in the case of national forest inventories, the contribution of uncertainty in predicting individuals can be ignored. When the number of trees is small, as may be the case in forest manipulation studies, both sources of allometric uncertainty are likely important and should be accounted for. 
    more » « less