Abstract The Patterson F- and D-statistics are commonly used measures for quantifying population relationships and for testing hypotheses about demographic history. These statistics make use of allele frequency information across populations to infer different aspects of population history, such as population structure and introgression events. Inclusion of related or inbred individuals can bias such statistics, which may often lead to the filtering of such individuals. Here, we derive statistical properties of the F- and D-statistics, including their biases due to the inclusion of related or inbred individuals, their variances, and their corresponding mean squared errors. Moreover, for those statistics that are biased, we develop unbiased estimators and evaluate the variances of these new quantities. Comparisons of the new unbiased statistics to the originals demonstrates that our newly derived statistics often have lower error across a wide population parameter space. Furthermore, we apply these unbiased estimators using several global human populations with the inclusion of related individuals to highlight their application on an empirical dataset. Finally, we implement these unbiased estimators in open-source software package funbiased for easy application by the scientific community.
more »
« less
Statistical Analysis with Linked Data
Summary Computerised Record Linkage methods help us combine multiple data sets from different sources when a single data set with all necessary information is unavailable or when data collection on additional variables is time consuming and extremely costly. Linkage errors are inevitable in the linked data set because of the unavailability of error‐free unique identifiers. A small amount of linkage errors can lead to substantial bias and increased variability in estimating parameters of a statistical model. In this paper, we propose a unified theory for statistical analysis with linked data. Our proposed method, unlike the ones available for secondary data analysis of linked data, exploits record linkage process data as an alternative to taking a costly sample to evaluate error rates from the record linkage procedure. A jackknife method is introduced to estimate bias, covariance matrix and mean squared error of our proposed estimators. Simulation results are presented to evaluate the performance of the proposed estimators that account for linkage errors.
more »
« less
- Award ID(s):
- 1758808
- PAR ID:
- 10078219
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- International Statistical Review
- Volume:
- 87
- Issue:
- S1
- ISSN:
- 0306-7734
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Summary The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.more » « less
-
Abstract Understanding the nature and origin of errors in satellite precipitation products is important for applications and product improvement. Here we propose a new error decomposition scheme incorporating precipitation event (continuous rainy periods) information to characterize satellite errors. Under this framework, the errors are attributed to the inaccuracies in event occurrence, timing (event start/end time), and intensity. The Integrated MultisatellitE Retrieval for Global Precipitation Measurement (IMERG) is used as our test product to apply the method over CONUS. The above‐listed factors contribute approximately 30%, 20%, and 50% to the total bias, respectively. Significant asymmetry exists in the temporal distribution of biases throughout events: early event endings cause threefold more precipitation amount bias than late event beginnings, while early event beginnings cause fourfold more bias than late event endings. Dominant contributors vary across seasons and regions. The proposed error decomposition provides insight into sources of error for improved retrievals.more » « less
-
Abstract Phenology is one of the most immediate responses to global climate change, but data limitations have made examining phenology patterns across greater taxonomic, spatial and temporal scales challenging. One significant opportunity is leveraging rapidly increasing data resources from digitized museum specimens and community science platforms, but this assumes reliable statistical methods are available to estimate phenology using presence‐only data. Estimating the onset or offset of key events is especially difficult with incidental data, as lower data densities occur towards the tails of an abundance distribution.The Weibull distribution has been recognized as an appropriate distribution to estimate phenology based on presence‐only data, but Weibull‐informed estimators are only available for onset and offset. We describe the mathematical framework for a new Weibull‐parameterized estimator of phenology appropriate for any percentile of a distribution and make it available in anrpackage,phenesse. We use simulations and empirical data on open flower timing and first arrival of monarch butterflies to quantify the accuracy of our estimator and other commonly used phenological estimators for 10 phenological metrics: onset, mean and offset dates, as well as the 1st, 5th, 10th, 50th, 90th, 95th and 99th percentile dates. Root mean squared errors and mean bias of the phenological estimators were calculated for different patterns of abundance and observation processes.Results show a general pattern of decay in performance of estimates when moving from mean estimates towards the tails of the seasonal abundance curve, suggesting that onset and offset continue to be the most difficult phenometrics to estimate. However, with simple phenologies and enough observations, our newly developed estimator can provide useful onset and offset estimates. This is especially true for the start of the season, when incidental observations may be more common.Our simulation demonstrates the potential of generating accurate phenological estimates from presence‐only data and guides the best use of estimators. The estimator that we developed, phenesse, is the least biased and has the lowest estimation error for onset estimates under most simulated and empirical conditions examined, improving the robustness of these estimates for phenological research.more » « less
-
Summary We study quantile trend filtering, a recently proposed method for nonparametric quantile regression, with the goal of generalizing existing risk bounds for the usual trend-filtering estimators that perform mean regression. We study both the penalized and the constrained versions, of order $$r \geqslant 1$$, of univariate quantile trend filtering. Our results show that both the constrained and the penalized versions of order $$r \geqslant 1$$ attain the minimax rate up to logarithmic factors, when the $(r-1)$th discrete derivative of the true vector of quantiles belongs to the class of bounded-variation signals. Moreover, we show that if the true vector of quantiles is a discrete spline with a few polynomial pieces, then both versions attain a near-parametric rate of convergence. Corresponding results for the usual trend-filtering estimators are known to hold only when the errors are sub-Gaussian. In contrast, our risk bounds are shown to hold under minimal assumptions on the error variables. In particular, no moment assumptions are needed and our results hold under heavy-tailed errors. Our proof techniques are general, and thus can potentially be used to study other nonparametric quantile regression methods. To illustrate this generality, we employ our proof techniques to obtain new results for multivariate quantile total-variation denoising and high-dimensional quantile linear regression.more » « less