skip to main content


Title: Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples
We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) and Kernel Density (KD) methods. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. And, whereas BC and KD each require optimal tuning of a hyper-parameter whose value varies with sample size and shape of the pdf, QS only requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantiles is a fixed fraction of the sample size (empirically determined to be ~0.25–0.35), and that this value is relatively insensitive to distributional form or sample size. This provides a clear advantage over BC and KD since hyper-parameter tuning is not required. Further, unlike KD, there is no need to select an appropriate kernel-type, and so QS is applicable to pdfs of arbitrary shape, including those with discontinuous slope and/or magnitude. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is low even for samples of as few as 100 data points; in contrast, for KD the small sample bias can be as large as −10% and for BC as large as −50%. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.  more » « less
Award ID(s):
1740858
NSF-PAR ID:
10294762
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Entropy
Volume:
23
Issue:
6
ISSN:
1099-4300
Page Range / eLocation ID:
740
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Home range estimation is routine practice in ecological research. While advances in animal tracking technology have increased our capacity to collect data to support home range analysis, these same advances have also resulted in increasingly autocorrelated data. Consequently, the question of which home range estimator to use on modern, highly autocorrelated tracking data remains open. This question is particularly relevant given that most estimators assume independently sampled data. Here, we provide a comprehensive evaluation of the effects of autocorrelation on home range estimation. We base our study on an extensive data set ofGPSlocations from 369 individuals representing 27 species distributed across five continents. We first assemble a broad array of home range estimators, including Kernel Density Estimation (KDE) with four bandwidth optimizers (Gaussian reference function, autocorrelated‐Gaussian reference function [AKDE], Silverman's rule of thumb, and least squares cross‐validation), Minimum Convex Polygon, and Local Convex Hull methods. Notably, all of these estimators exceptAKDEassume independent and identically distributed (IID) data. We then employ half‐sample cross‐validation to objectively quantify estimator performance, and the recently introduced effective sample size for home range area estimation () to quantify the information content of each data set. We found thatAKDE95% area estimates were larger than conventionalIID‐based estimates by a mean factor of 2. The median number of cross‐validated locations included in the hold‐out sets byAKDE95% (or 50%) estimates was 95.3% (or 50.1%), confirming the largerAKDEranges were appropriately selective at the specified quantile. Conversely, conventional estimates exhibited negative bias that increased with decreasing . To contextualize our empirical results, we performed a detailed simulation study to tease apart how sampling frequency, sampling duration, and the focal animal's movement conspire to affect range estimates. Paralleling our empirical results, the simulation study demonstrated thatAKDEwas generally more accurate than conventional methods, particularly for small . While 72% of the 369 empirical data sets had >1,000 total observations, only 4% had an >1,000, where 30% had an <30. In this frequently encountered scenario of small ,AKDEwas the only estimator capable of producing an accurate home range estimate on autocorrelated data.

     
    more » « less
  2. Predicting the process of porosity-based ductile damage in polycrystalline metallic materials is an essential practical topic. Ductile damage and its precursors are represented by extreme values in stress and material state quantities, the spatial probability density function (PDF) of which are highly non-Gaussian with strong fat tails. Traditional deterministic forecasts utilizing sophisticated continuum-based physical models generally lack in representing the statistics of structural evolution during material deformation. Computational tools which do represent complex structural evolution are typically expensive. The inevitable model error and the lack of uncertainty quantification may also induce significant forecast biases, especially in predicting the extreme events associated with ductile damage. In this paper, a data-driven statistical reduced-order modeling framework is developed to provide a probabilistic forecast of the deformation process of a polycrystal aggregate leading to porosity-based ductile damage with uncertainty quantification. The framework starts with computing the time evolution of the leading few moments of specific state variables from the spatiotemporal solution of full- field polycrystal simulations. Then a sparse model identification algorithm based on causation entropy, including essential physical constraints, is utilized to discover the governing equations of these moments. An approximate solution of the time evolution of the PDF is obtained from the predicted moments exploiting the maximum entropy principle. Numerical experiments based on polycrystal realizations of a representative body-centered cubic (BCC) tantalum illustrate a skillful reduced-order model in characterizing the time evolution of the non-Gaussian PDF of the von Mises stress and quantifying the probability of extreme events. The learning process also reveals that the mean stress is not simply an additive forcing to drive the higher-order moments and extreme events. Instead, it interacts with the latter in a strongly nonlinear and multiplicative fashion. In addition, the calibrated moment equations provide a reasonably accurate forecast when applied to the realizations outside the training data set, indicating the robustness of the model and the skill for extrapolation. Finally, an information-based measurement is employed to quantitatively justify that the leading four moments are sufficient to characterize the crucial highly non-Gaussian features throughout the entire deformation history considered. 
    more » « less
  3. Abstract

    Although prior studies have evaluated the role of sampling errors associated with local and regional methods to estimate peak flow quantiles, the investigation of epistemic errors is more difficult because the underlying properties of the random variable have been prescribed using ad‐hoc characterizations of the regional distributions of peak flows. This study addresses this challenge using representations of regional peak flow distributions derived from a combined framework of stochastic storm transposition, radar rainfall observations, and distributed hydrologic modeling. The authors evaluated four commonly used peak flow quantile estimation methods using synthetic peak flows at 5,000 sites in the Turkey River watershed in Iowa, USA. They first used at‐site flood frequency analysis using the Pearson Type III distribution with L‐moments. The authors then pooled regional information using (1) the index flood method, (2) the quantile regression technique, and (3) the parameter regression. This approach allowed quantification of error components stemming from epistemic assumptions, parameter estimation method, sample size, and, in the regional approaches, the number ofpooledsites. The results demonstrate that the inability to capture the spatial variability of the skewness of the peak flows dominates epistemic error for regional methods. We concluded that, in the study basin, this variability could be partially explained by river network structure and the predominant orientation of the watershed. The general approach used in this study is promising in that it brings new tools and sources of data to the study of the old hydrologic problem of flood frequency analysis.

     
    more » « less
  4. Kidney exchanges allow patients with end-stage renal disease to find a lifesaving living donor by way of an organized market. However, not all patients are equally easy to match, nor are all donor organs of equal quality---some patients are matched within weeks, while others may wait for years with no match offers at all. We propose the first decision-support tool for kidney exchange that takes as input the biological features of a patient-donor pair, and returns (i) the probability of being matched prior to expiry, and (conditioned on a match outcome), (ii) the waiting time for and (iii) the organ quality of the matched transplant. This information may be used to inform medical and insurance decisions. We predict all quantities (i, ii, iii) exclusively from match records that are readily available in any kidney exchange using a quantile random forest approach. To evaluate our approach, we developed two state-of-the-art realistic simulators based on data from the United Network for Organ Sharing that sample from the training and test distribution for these learning tasks---in our application these distributions are distinct. We analyze distributional shift through a theoretical lens, and show that the two distributions converge as the kidney exchange nears steady-state. We then show that our approach produces clinically-promising estimates using simulated data. Finally, we show how our approach, in conjunction with tools from the model explainability literature, can be used to calibrate and detect bias in matching policies.

     
    more » « less
  5. ABSTRACT

    We evaluate the consistency between lensing and clustering based on measurements from Baryon Oscillation Spectroscopic Survey combined with galaxy–galaxy lensing from Dark Energy Survey (DES) Year 3, Hyper Suprime-Cam Subaru Strategic Program (HSC) Year 1, and Kilo-Degree Survey (KiDS)-1000. We find good agreement between these lensing data sets. We model the observations using the Dark Emulator and fit the data at two fixed cosmologies: Planck (S8 = 0.83), and a Lensing cosmology (S8 = 0.76). For a joint analysis limited to large scales, we find that both cosmologies provide an acceptable fit to the data. Full utilization of the higher signal-to-noise small-scale measurements is hindered by uncertainty in the impact of baryon feedback and assembly bias, which we account for with a reasoned theoretical error budget. We incorporate a systematic inconsistency parameter for each redshift bin, A, that decouples the lensing and clustering. With a wide range of scales, we find different results for the consistency between the two cosmologies. Limiting the analysis to the bins for which the impact of the lens sample selection is expected to be minimal, for the Lensing cosmology, the measurements are consistent with A = 1; A = 0.91 ± 0.04 (A = 0.97 ± 0.06) using DES+KiDS (HSC). For the Planck case, we find a discrepancy: A = 0.79 ± 0.03 (A = 0.84 ± 0.05) using DES+KiDS (HSC). We demonstrate that a kinematic Sunyaev–Zeldovich-based estimate for baryonic effects alleviates some of the discrepancy in the Planck cosmology. This analysis demonstrates the statistical power of small-scale measurements; however, caution is still warranted given modelling uncertainties and foreground sample selection effects.

     
    more » « less