skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples
We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) and Kernel Density (KD) methods. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. And, whereas BC and KD each require optimal tuning of a hyper-parameter whose value varies with sample size and shape of the pdf, QS only requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantiles is a fixed fraction of the sample size (empirically determined to be ~0.25–0.35), and that this value is relatively insensitive to distributional form or sample size. This provides a clear advantage over BC and KD since hyper-parameter tuning is not required. Further, unlike KD, there is no need to select an appropriate kernel-type, and so QS is applicable to pdfs of arbitrary shape, including those with discontinuous slope and/or magnitude. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is low even for samples of as few as 100 data points; in contrast, for KD the small sample bias can be as large as −10% and for BC as large as −50%. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.  more » « less
Award ID(s):
1740858
PAR ID:
10294762
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Entropy
Volume:
23
Issue:
6
ISSN:
1099-4300
Page Range / eLocation ID:
740
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A set of probabilities along with corresponding quantiles are often used to define predictive distributions or probabilistic forecasts. These quantile predictions offer easily interpreted uncertainty of an event, and quantiles are generally straightforward to estimate using standard statistical and machine learning methods. However, compared to a distribution defined by a probability density or cumulative distribution function, a set of quantiles has less distributional information. When given estimated quantiles, it may be desirable to estimate a fully defined continuous distribution function. Many researchers do so to make evaluation or ensemble modeling simpler. Most existing methods for fitting a distribution to quantiles lack accurate representation of the inherent uncertainty from quantile estimation or are limited in their applications. In this manuscript, we present a Gaussian process model, the quantile Gaussian process –based on established asymptotic results of quantile functions and sample quantiles– to construct a probability distribution given estimated quantiles. In some applications, the form of an unknown distribution function from which sample quantiles are drawn must be estimated, for which case we propose the use of a latent truncated Dirichlet process mixture model for estimation. A Bayesian application of the quantile Gaussian process is evaluated for parameter inference and distribution approximation in simulation studies as well as in a real data analysis of quantile forecasts from the 2023-24 US Centers for Disease Control collaborative flu forecasting initiative. The simulation studies and data analysis show that compared to other existing methods, the quantile Gaussian process leads to accurate inference on model parameters, estimation of a continuous distribution, and uncertainty quantification of sample quantiles. 
    more » « less
  2. Abstract This paper extends the application of quantile-based Bayesian inference to probability distributions defined in terms of quantiles of observable quantities. Quantile-parameterized distributions are characterized by high shape flexibility and parameter interpretability, making them useful for eliciting information about observables. To encode uncertainty in the quantiles elicited from experts, we propose a Bayesian model based on the metalog distribution and a variant of the Dirichlet prior. We discuss the resulting hybrid expert elicitation protocol, which aims to characterize uncertainty in parameters by asking questions about observable quantities. We also compare and contrast this approach with parametric and predictive elicitation methods. 
    more » « less
  3. Radial distribution functions (RDFs) are widely used in molecular simulation and beyond. Most approaches to computing RDFs require assembling a histogram over inter-particle separation distances. In turn, these histograms require a specific (and generally arbitrary) choice of discretization for bins. We demonstrate that this arbitrary choice for binning can lead to significant and spurious phenomena in several commonplace molecular-simulation analyses that make use of RDFs, such as identifying phase boundaries and generating excess entropy scaling relationships. We show that a straightforward approach (which we term Kernel-Averaging Method to Eliminate Length-Of-Bin Effects) mitigates these issues. This approach is based on systematic and mass-conserving mollification of RDFs using a Gaussian kernel. This technique has several advantages compared to existing methods, including being useful for cases where the original particle kinematic data have not been retained, and the only available data are the RDFs themselves. We also discuss the optimal implementation of this approach in the context of several application areas. 
    more » « less
  4. An ε-approximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s t-digest, which often achieves much better approximations than KLL on realworld data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case. 
    more » « less
  5. Predicting the process of porosity-based ductile damage in polycrystalline metallic materials is an essential practical topic. Ductile damage and its precursors are represented by extreme values in stress and material state quantities, the spatial probability density function (PDF) of which are highly non-Gaussian with strong fat tails. Traditional deterministic forecasts utilizing sophisticated continuum-based physical models generally lack in representing the statistics of structural evolution during material deformation. Computational tools which do represent complex structural evolution are typically expensive. The inevitable model error and the lack of uncertainty quantification may also induce significant forecast biases, especially in predicting the extreme events associated with ductile damage. In this paper, a data-driven statistical reduced-order modeling framework is developed to provide a probabilistic forecast of the deformation process of a polycrystal aggregate leading to porosity-based ductile damage with uncertainty quantification. The framework starts with computing the time evolution of the leading few moments of specific state variables from the spatiotemporal solution of full- field polycrystal simulations. Then a sparse model identification algorithm based on causation entropy, including essential physical constraints, is utilized to discover the governing equations of these moments. An approximate solution of the time evolution of the PDF is obtained from the predicted moments exploiting the maximum entropy principle. Numerical experiments based on polycrystal realizations of a representative body-centered cubic (BCC) tantalum illustrate a skillful reduced-order model in characterizing the time evolution of the non-Gaussian PDF of the von Mises stress and quantifying the probability of extreme events. The learning process also reveals that the mean stress is not simply an additive forcing to drive the higher-order moments and extreme events. Instead, it interacts with the latter in a strongly nonlinear and multiplicative fashion. In addition, the calibrated moment equations provide a reasonably accurate forecast when applied to the realizations outside the training data set, indicating the robustness of the model and the skill for extrapolation. Finally, an information-based measurement is employed to quantitatively justify that the leading four moments are sufficient to characterize the crucial highly non-Gaussian features throughout the entire deformation history considered. 
    more » « less