skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures
We study the problem of learning nonparametric distributions in a finite mixture, and establish tight bounds on the sample complexity for learning the component distributions in such models.Namely, we are given i.i.d. samples from a pdf f where f=w1f1+w2f2,w1+w2=1,w1,w2>0 and we are interested in learning each component fi .Without any assumptions on fi , this problem is ill-posed.In order to identify the components fi , we assume that each fi can be written as a convolution of a Gaussian and a compactly supported density νi with supp(ν1)∩supp(ν2)=∅ .Our main result shows that (1ε)Ω(loglog1ε) samples are required for estimating each fi . The proof relies on a quantitative Tauberian theorem that yields a fast rate of approximation with Gaussians, which may be of independent interest. To show this is tight, we also propose an algorithm that uses (1ε)O(loglog1ε) samples to estimate each fi . Unlike existing approaches to learning latent variable models based on moment-matching and tensor methods, our proof instead involves a delicate analysis of an ill-conditioned linear system via orthogonal functions.Combining these bounds, we conclude that the optimal sample complexity of this problem properly lies in between polynomial and exponential, which is not common in learning theory.  more » « less
Award ID(s):
1956330
PAR ID:
10542254
Author(s) / Creator(s):
;
Publisher / Repository:
Proceedings of Thirty Sixth Conference on Learning Theory
Date Published:
Volume:
195
Page Range / eLocation ID:
2849-2849
Subject(s) / Keyword(s):
learning theory nonparametric statistics sample complexity lower bounds deconvolution
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Given a mixture between two populations of coins, “positive” coins that each have unknown and potentially different—bias ≥ 1 + ∆ and “negative” coins with bias ≤ 2 − ∆, we consider the task of estimating the fraction ρ of positive coins to within additive error E. We achieve an upper and lower bound of Θ( ρ log 1 ) samples for a 1 −δ probability of success, where crucially, our lower bound applies to all fully-adaptive algorithms. Thus, our sample complexity bounds have tight dependence for every relevant problem parameter. A crucial component of our lower bound proof is a decomposition lemma (Lemma 5.2) showing how to assemble partially-adaptive bounds into a fully-adaptive bound, which may be of independent interest: though we invoke it for the special case of Bernoulli random variables (coins), it applies to general distributions. We present sim- ulation results to demonstrate the practical efficacy of our approach for realistic problem parameters for crowdsourcing applications, focusing on the “rare events” regime where ρ is small. The fine-grained adaptive flavor of both our algo- rithm and lower bound contrasts with much previous workin distributional testing and learning. 
    more » « less
  2. We consider the problem of estimating sparse discrete distributions under local differential privacy (LDP) and communication constraints. We characterize the sample complexity for sparse estimation under LDP constraints up to a constant factor, and the sample complexity under communication constraints up to a logarithmic factor. Our upper bounds under LDP are based on the Hadamard Response, a private coin scheme that requires only one bit of communication per user. Under communication constraints we propose public coin schemes based on random hashing functions. Our tight lower bounds are based on recently proposed method of chi squared contractions. 
    more » « less
  3. null (Ed.)
    ABSTRACT Accurate metallicities of RR Lyrae are extremely important in constraining period–luminosity–metallicity (PLZ) relationships, particularly in the near-infrared. We analyse 69 high-resolution spectra of Galactic RR Lyrae stars from the Southern African Large Telescope. We measure metallicities of 58 of these RR Lyrae stars with typical uncertainties of 0.15 dex. All but one RR Lyrae in this sample has accurate ($$\sigma _{\varpi }\lesssim 10{{\ \rm per\ cent}}$$) parallax from Gaia. Combining these new high-resolution spectroscopic abundances with similar determinations from the literature for 93 stars, we present new PLZ relationships in WISE W1 and W2 magnitudes, and the Wesenheit magnitudes W(W1, V − W1) and W(W2, V − W2). 
    more » « less
  4. A number of applications require two-sample testing on ranked preference data. For instance, in crowdsourcing, there is a long-standing question of whether pairwise comparison data provided by people is distributed similar to ratings-converted-to-comparisons. Other examples include sports data analysis and peer grading. In this paper, we design two-sample tests for pairwise comparison data and ranking data. For our two-sample test for pairwise comparison data, we establish an upper bound on the sample complexity required to correctly distinguish between the distributions of the two sets of samples. Our test requires essentially no assumptions on the distributions. We then prove complementary lower bounds showing that our results are tight (in the minimax sense) up to constant factors. We investigate the role of modeling assumptions by proving lower bounds for a range of pairwise comparison models (WST, MST, SST, parameter-based such as BTL and Thurstone). We also provide testing algorithms and associated sample complexity bounds for the problem of two-sample testing with partial (or total) ranking data. Furthermore, we empirically evaluate our results via extensive simulations as well as two real-world datasets consisting of pairwise comparisons. By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently. On the other hand, our test recognizes no significant difference in the relative performance of European football teams across two seasons. Finally, we apply our two-sample test on a real-world partial and total ranking dataset and find a statistically significant difference in Sushi preferences across demographic divisions based on gender, age and region of residence. 
    more » « less
  5. We study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution D over [n] and a property P, the goal is to distinguish between D ∈ P and ℓ1(D, P) > ε. We develop a general algorithm for this question, which applies to a large range of “shape-constrained” properties, including monotone, log-concave, t-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly–tight upper and lower bounds for the corresponding questions. 
    more » « less