skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Testing Tail Weight of a Distribution Via Hazard Rate
Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of {\em distribution property testing}, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions.  more » « less
Award ID(s):
2022448
PAR ID:
10431030
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Algorithmic Learning Theory (ALT 2023)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose and analyze a new estimator of the covariance matrix that admits strong theoretical guarantees under weak assumptions on the underlying distribution, such as existence of moments of only low order. While estimation of covariance matrices corresponding to sub-Gaussian distributions is well-understood, much less in known in the case of heavy-tailed data. As K. Balasubramanian and M. Yuan write, "data from real-world experiments oftentimes tend to be corrupted with outliers and/or exhibit heavy tails. In such cases, it is not clear that those covariance matrix estimators .. remain optimal" and "what are the other possible strategies to deal with heavy tailed distributions warrant further studies." We make a step towards answering this question and prove tight deviation inequalities for the proposed estimator that depend only on the parameters controlling the intrinsic dimension'' associated to the covariance matrix (as opposed to the dimension of the ambient space); in particular, our results are applicable in the case of high-dimensional observations. 
    more » « less
  2. We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution p, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor’s quality, measured by its total variation distance from p. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation’s accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach. 
    more » « less
  3. We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well. 
    more » « less
  4. Firoozi, R.; Mehr, N.; Yel, E.; Antonova, R; Bohg, J.; Schwager, M.; Kochenderfer, M. (Ed.)
    This work considers the problem of learning the Markov parameters of a linear system from ob- served data. Recent non-asymptotic system identification results have characterized the sample complexity of this problem in the single and multi-rollout setting. In both instances, the number of samples required in order to obtain acceptable estimates can produce optimization problems with an intractably large number of decision variables for a second-order algorithm. We show that a randomized and distributed Newton algorithm based on Hessian-sketching can produce ε-optimal solutions and converges geometrically. Moreover, the algorithm is trivially parallelizable. Our re- sults hold for a variety of sketching matrices and we illustrate the theory with numerical examples. 
    more » « less
  5. We consider the problem of scheduling to minimize asymptotic tail latency in an M/G/1 queue with unknown job sizes. When the job size distribution is heavy-tailed, numerous policies that do not require job size information (e.g. Processor Sharing, Least Attained Service) are known to be strongly tail optimal, meaning that their response time tail has the fastest possible asymptotic decay. In contrast, for light-tailed size distributions, only in the last few years have policies been developed that outperform simple First-Come First-Served (FCFS). The most recent of these is γ-Boost, which achieves strong tail optimality in the light-tailed setting. But thus far, all policies that outperform FCFS in the light-tailed setting, including γ-Boost, require known job sizes. In this paper, we design a new scheduling policy that achieves strong tail optimality in the light-tailed M/G/1 with unknown job sizes. Surprisingly, the optimal policy turns out to be a variant of the Gittins policy, but with a novel and unusual feature: it uses a negative discount rate. Our work also applies to systems with partial information about job sizes, covering γ-Boost as an extreme case when job sizes are in fact fully known. 
    more » « less