We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an arbitrary and unknown distribution D. The efficiency of our transformation scales with the inherent complexity of D, running in (n, (md)d) time for distributions over n whose pmfs are computed by depthd decision trees, where m is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from D, and for general ones it uses subcube conditioning samples.
A key technical ingredient is an algorithm which, given the aforementioned access to D, produces an optimal decision tree decomposition of D: an approximation of D as a mixture of uniform distributions over disjoint subcubes. With this decomposition in hand, we run the uniformdistribution learner on each subcube and combine the hypotheses using the decision tree. This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially improve on the prior state of the art—results of independent interest in distribution learning.
more »
« less
Hyperparameter Optimization: A Spectral Approach
We give a simple, fast algorithm for hyperparameter optimization inspired by techniques from the analysis of Boolean functions. We focus on the highdimensional regime where the canonical example is training a neural network with a large number of hyperparameters. The algorithm  an iterative application of compressed sensing techniques for orthogonal polynomials  requires only uniform sampling of the hyperparameters and is thus easily parallelizable.
Experiments for training deep neural networks on Cifar10 show that compared to stateoftheart tools (e.g., Hyperband and Spearmint), our algorithm finds significantly improved solutions, in some cases better than what is attainable by handtuning. In terms of overall running time (i.e., time required to sample various settings of hyperparameters plus additional computation time), we are at least an order of magnitude faster than Hyperband and Bayesian Optimization. We also outperform Random Search 8x.
Additionally, our method comes with provable guarantees and yields the first improvements on the sample complexity of learning decision trees in over two decades. In particular, we obtain the first quasipolynomial time algorithm for learning noisy decision trees with polynomial sample complexity.
more »
« less
 Award ID(s):
 1717896
 NSFPAR ID:
 10080190
 Date Published:
 Journal Name:
 ICLR
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We give the first reconstruction algorithm for decision trees: given queries to a function f that is optclose to a sizes decision tree, our algorithm provides query access to a decision tree T where:  T has size S := s^O((log s)²/ε³);  dist(f,T) ≤ O(opt)+ε;  Every query to T is answered with poly((log s)/ε)⋅ log n queries to f and in poly((log s)/ε)⋅ n log n time. This yields a tolerant tester that distinguishes functions that are close to sizes decision trees from those that are far from sizeS decision trees. The polylogarithmic dependence on s in the efficiency of our tester is exponentially smaller than that of existing testers. Since decision tree complexity is well known to be related to numerous other boolean function properties, our results also provide a new algorithm for reconstructing and testing these properties.more » « less

We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the pres ence of noise. More precisely, our reduction shows that any polynomialtime algorithm (not necessarily gradientbased) for learning such functions under small noise implies a polynomialtime quantum algorithm for solving worstcase lattice problems, whose hardness form the foundation of latticebased cryptography. Our core hard family of functions, which are wellapproximated by onelayer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradientbased (Shamir’18), and Statisti cal Query (SQ) algorithms (Song et al.’17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomialtime algorithms, beyond gradientbased and SQ algorithms, under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomialtime algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradientbased or an SQ algorithm, but is rather based on the celebrated LenstraLenstraLovász (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.’21) and phase retrieval with an optimal sample complexity of d + 1 samples. In the former case, this improves upon the quadraticind sample complexity required in (Bruna et al.’21).more » « less

null (Ed.)Abstract We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the presence of noise. More precisely, our reduction shows that any polynomialtime algorithm (not necessarily gradientbased) for learning such functions under small noise implies a polynomialtime quantum algorithm for solving worstcase lattice problems, whose hardness form the foundation of latticebased cryptography. Our core hard family of functions, which are wellapproximated by onelayer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradientbased (Shamir’18), and Statistical Query (SQ) algorithms (Song et al.’17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomialtime algorithms, beyond gradientbased and SQ algorithms, under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomialtime algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradientbased or an SQ algorithm, but is rather based on the celebrated LenstraLenstraLovász (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.’21) and phase retrieval with an optimal sample complexity of d + 1 samples. In the former case, this improves upon the quadraticind sample complexity required in (Bruna et al.’21).more » « less

Binary classification is a fundamental machine learning task defined as correctly assigning new objects to one of two groups based on a set of training objects. Driven by the practical importance of binary classification, numerous machine learning techniques have been developed and refined over the last three decades. Among the most popular techniques are artificial neural networks, decision trees, ensemble methods, logistic regression, and support vector machines. We present here machine learning and pattern recognition algorithms that, unlike the commonly used techniques, are based on combinatorial optimization and make use of information on pairwise relations between the objects of the data set, whether training objects or not. These algorithms solve the respective problems optimally and efficiently, in contrast to the primarily heuristic approaches currently used for intractable problem models in pattern recognition and machine learning. The algorithms described solve efficiently the classification problem as a network flow problem on a graph. The technical tools used in the algorithm are the parametric cut procedure and a process called sparse computation that computes only the pairwise similarities that are “relevant.” Sparse computation enables the scalability of any algorithm that uses pairwise similarities. We present evidence on the effectiveness of the approaches, measured in terms of accuracy and running time, in pattern recognition, image segmentation, and general data mining.more » « less