The matrix completion problem seeks to recover a $d\times d$ ground
truth matrix of low rank $r\ll d$ from observations of its individual
elements. Realworld matrix completion is often a hugescale optimization
problem, with $d$ so large that even the simplest fulldimension
vector operations with $O(d)$ time complexity become prohibitively
expensive. Stochastic gradient descent (SGD) is one of the few algorithms
capable of solving matrix completion on a huge scale, and can also
naturally handle streaming data over an evolving ground truth. Unfortunately,
SGD experiences a dramatic slowdown when the underlying ground truth
is illconditioned; it requires at least $O(\kappa\log(1/\epsilon))$
iterations to get $\epsilon$close to ground truth matrix with condition
number $\kappa$. In this paper, we propose a preconditioned version
of SGD that preserves all the favorable practical qualities of SGD
for hugescale online optimization while also making it agnostic to
$\kappa$. For a symmetric ground truth and the Root Mean Square Error
(RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$accuracy
in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence
rate as if the ground truth were perfectly conditioned with $\kappa=1$.
In our numerical experiments, we observe a similar acceleration for
illconditioned matrix completion under the 1bit crossentropy loss,
as well as pairwise losses such as the Bayesian Personalized Ranking (BPR) loss.
more »
« less
Stochastic VarianceReduced Hamilton Monte Carlo Methods
We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly logconcave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{2}d\sigma^2}{\epsilon^2}
\big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the stateoftheart HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general logconcave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.
more »
« less
 NSFPAR ID:
 10063549
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


The noise sensitivity of a Boolean function f: {0,1}^n  > {0,1} is one of its fundamental properties. For noise parameter delta, the noise sensitivity is denoted as NS_{delta}[f]. This quantity is defined as follows: First, pick x = (x_1,...,x_n) uniformly at random from {0,1}^n, then pick z by flipping each x_i independently with probability delta. NS_{delta}[f] is defined to equal Pr [f(x) != f(z)]. Much of the existing literature on noise sensitivity explores the following two directions: (1) Showing that functions with low noisesensitivity are structured in certain ways. (2) Mathematically showing that certain classes of functions have low noise sensitivity. Combined, these two research directions show that certain classes of functions have low noise sensitivity and therefore have useful structure. The fundamental importance of noise sensitivity, together with this wealth of structural results, motivates the algorithmic question of approximating NS_{delta}[f] given an oracle access to the function f. We show that the standard sampling approach is essentially optimal for general Boolean functions. Therefore, we focus on estimating the noise sensitivity of monotone functions, which form an important subclass of Boolean functions, since many functions of interest are either monotone or can be simply transformed into a monotone function (for example the class of unate functions consists of all the functions that can be made monotone by reorienting some of their coordinates [O'Donnell, 2014]). Specifically, we study the algorithmic problem of approximating NS_{delta}[f] for monotone f, given the promise that NS_{delta}[f] >= 1/n^{C} for constant C, and for delta in the range 1/n <= delta <= 1/2. For such f and delta, we give a randomized algorithm performing O((min(1,sqrt{n} delta log^{1.5} n))/(NS_{delta}[f]) poly (1/epsilon)) queries and approximating NS_{delta}[f] to within a multiplicative factor of (1 +/ epsilon). Given the same constraints on f and delta, we also prove a lower bound of Omega((min(1,sqrt{n} delta))/(NS_{delta}[f] * n^{xi})) on the query complexity of any algorithm that approximates NS_{delta}[f] to within any constant factor, where xi can be any positive constant. Thus, our algorithm's query complexity is close to optimal in terms of its dependence on n. We introduce a novel descendingascending view of noise sensitivity, and use it as a central tool for the analysis of our algorithm. To prove lower bounds on query complexity, we develop a technique that reduces computational questions about query complexity to combinatorial questions about the existence of "thin" functions with certain properties. The existence of such "thin" functions is proved using the probabilistic method. These techniques also yield new lower bounds on the query complexity of approximating other fundamental properties of Boolean functions: the total influence and the bias.more » « less

Belkin, M. ; Kpotufe, S. (Ed.)Langevin algorithms are gradient descent methods with additive noise. They have been used for decades in Markov Chain Monte Carlo (MCMC) sampling, optimization, and learning. Their convergence properties for unconstrained nonconvex optimization and learning problems have been studied widely in the last few years. Other work has examined projected Langevin algorithms for sampling from logconcave distributions restricted to convex compact sets. For learning and optimization, logconcave distributions correspond to convex losses. In this paper, we analyze the case of nonconvex losses with compact convex constraint sets and IID external data variables. We term the resulting method the projected stochastic gradient Langevin algorithm (PSGLA). We show the algorithm achieves a deviation of π(πβ1/4(ππππ)1/2) from its target distribution in 1Wasserstein distance. For optimization and learning, we show that the algorithm achieves πsuboptimal solutions, on average, provided that it is run for a time that is polynomial in π and slightly superexponential in the problem dimension.more » « less

This work proposes a new algorithm β the Singletimescale Doublemomentum Stochastic Approximation (SUSTAIN) βfor tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is stronglyconvex and the upper level objective function is smooth. Unlike prior works which rely on twotimescale or double loop techniques, we design a stochastic momentumassisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly nonconvex, we show that SUSTAIN requires ${O}(\epsilon^{3/2})$ iterations (each using $O(1)$ samples) to find an $\epsilon$stationary solution. The $\epsilon$stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $\epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions match the bestknown complexity for singlelevel stochastic gradient algorithms. We also analyze the case when the upper level objective function is stronglyconvex.more » « less

We study local symmetry breaking problems in the Congest model, focusing on ruling set problems, which generalize the fundamental Maximal Independent Set (MIS) problem. The time (round) complexity of MIS (and ruling sets) have attracted much attention in the Local model. Indeed, recent results (Barenboim et al., FOCS 2012, Ghaffari SODA 2016) for the MIS problem have tried to break the longstanding O(log n)round "barrier" achieved by Luby's algorithm, but these yield o(log n)round complexity only when the maximum degree Delta is somewhat small relative to n. More importantly, these results apply only in the Local model. In fact, the best known time bound in the Congest model is still O(log n) (via Luby's algorithm) even for moderately small Delta (i.e., for Delta = Omega(log n) and Delta = o(n)). Furthermore, message complexity has been largely ignored in the context of local symmetry breaking. Luby's algorithm takes O(m) messages on medge graphs and this is the best known bound with respect to messages. Our work is motivated by the following central question: can we break the Theta(log n) time complexity barrier and the Theta(m) message complexity barrier in the Congest model for MIS or closelyrelated symmetry breaking problems? This paper presents progress towards this question for the distributed ruling set problem in the Congest model. A betaruling set is an independent set such that every node in the graph is at most beta hops from a node in the independent set. We present the following results:  Time Complexity: We show that we can break the O(log n) "barrier" for 2 and 3ruling sets. We compute 3ruling sets in O(log n/log log n) rounds with high probability (whp). More generally we show that 2ruling sets can be computed in O(log Delta (log n)^(1/2 + epsilon) + log n/log log n) rounds for any epsilon > 0, which is o(log n) for a wide range of Delta values (e.g., Delta = 2^(log n)^(1/2epsilon)). These are the first 2 and 3ruling set algorithms to improve over the O(log n)round complexity of Luby's algorithm in the Congest model.  Message Complexity: We show an Omega(n^2) lower bound on the message complexity of computing an MIS (i.e., 1ruling set) which holds also for randomized algorithms and present a contrast to this by showing a randomized algorithm for 2ruling sets that, whp, uses only O(n log^2 n) messages and runs in O(Delta log n) rounds. This is the first messageefficient algorithm known for ruling sets, which has message complexity nearly linear in n (which is optimal up to a polylogarithmic factor).more » « less