The matrix completion problem seeks to recover a $d\times d$ ground
truth matrix of low rank $r\ll d$ from observations of its individual
elements. Realworld matrix completion is often a hugescale optimization
problem, with $d$ so large that even the simplest fulldimension
vector operations with $O(d)$ time complexity become prohibitively
expensive. Stochastic gradient descent (SGD) is one of the few algorithms
capable of solving matrix completion on a huge scale, and can also
naturally handle streaming data over an evolving ground truth. Unfortunately,
SGD experiences a dramatic slowdown when the underlying ground truth
is illconditioned; it requires at least $O(\kappa\log(1/\epsilon))$
iterations to get $\epsilon$close to ground truth matrix with condition
number $\kappa$. In this paper, we propose a preconditioned version
of SGD that preserves all the favorable practical qualities of SGD
for hugescale online optimization while also making it agnostic to
$\kappa$. For a symmetric ground truth and the Root Mean Square Error
(RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$accuracy
in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence
rate as if the ground truth were perfectly conditioned with $\kappa=1$.
In our numerical experiments, we observe a similar acceleration for
illconditioned matrix completion under the 1bit crossentropy loss,
as well as pairwise losses such as the Bayesian Personalized Ranking (BPR) loss.
more »
« less
Logarithmic heavy traffic error bounds in generalized switch and load balancing systems
Abstract Motivated by applications to wireless networks, cloud computing, data centers, etc., stochastic processing networks have been studied in the literature under various asymptotic regimes. In the heavy traffic regime, the steadystate mean queue length is proved to be $\Theta({1}/{\epsilon})$ , where $\epsilon$ is the heavy traffic parameter (which goes to zero in the limit). The focus of this paper is on obtaining queue length bounds on prelimit systems, thus establishing the rate of convergence to heavy traffic. For the generalized switch, operating under the MaxWeight algorithm, we show that the mean queue length is within $\textrm{O}({\log}({1}/{\epsilon}))$ of its heavy traffic limit. This result holds regardless of the complete resource pooling (CRP) condition being satisfied. Furthermore, when the CRP condition is satisfied, we show that the mean queue length under the MaxWeight algorithm is within $\textrm{O}({\log}({1}/{\epsilon}))$ of the optimal scheduling policy. Finally, we obtain similar results for the rate of convergence to heavy traffic of the total queue length in load balancing systems operating under the ‘join the shortest queue’ routeing algorithm.
more »
« less
 NSFPAR ID:
 10406004
 Date Published:
 Journal Name:
 Journal of Applied Probability
 Volume:
 59
 Issue:
 3
 ISSN:
 00219002
 Page Range / eLocation ID:
 652 to 669
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Abstract We present a new elementary algorithm that takes $$ \textrm{time} \ \ O_\epsilon \left( x^{\frac{3}{5}} (\log x)^{\frac{8}{5}+\epsilon } \right) \ \ \textrm{and} \ \textrm{space} \ \ O\left( x^{\frac{3}{10}} (\log x)^{\frac{13}{10}} \right) $$ time O ϵ x 3 5 ( log x ) 8 5 + ϵ and space O x 3 10 ( log x ) 13 10 (measured bitwise) for computing $$M(x) = \sum _{n \le x} \mu (n),$$ M ( x ) = ∑ n ≤ x μ ( n ) , where $$\mu (n)$$ μ ( n ) is the Möbius function. This is the first improvement in the exponent of x for an elementary algorithm since 1985. We also show that it is possible to reduce space consumption to $$O(x^{1/5} (\log x)^{5/3})$$ O ( x 1 / 5 ( log x ) 5 / 3 ) by the use of (Helfgott in: Math Comput 89:333–350, 2020), at the cost of letting time rise to the order of $$x^{3/5} (\log x)^2 \log \log x$$ x 3 / 5 ( log x ) 2 log log x .more » « less

In this paper, we show that under overparametrization several standard stochastic optimization algorithms escape saddlepoints and converge to localminimizers much faster. One of the fundamental aspects of overparametrized models is that they are capable of interpolating the training data. We show that, under interpolationlike assumptions satisfied by the stochastic gradients in an overparametrization setting, the firstorder oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilonlocalminimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic CubicRegularized Newton (SCRN) algorithm under interpolationlike conditions, and show that the oracle complexity to reach an \epsilonlocalminimizer under interpolationlike conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolationlike assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic CubicRegularized Newton method. It seems further Hessianbased interpolationlike assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zerothorder settings.more » « less

We consider the problem of performing linear regression over a stream of ddimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples (a_1,b_1), (a_2,b_2)..., with a_i drawn independently from a ddimensional isotropic Gaussian, and where b_i =more » « less
+ \eta_i, for a fixed x in R^d with x= 1 and with independent noise \eta_i drawn uniformly from the interval [2^{d/5},2^{d/5}]. We show that any algorithm with at most d^2/4 bits of memory requires at least \Omega(d \log \log \frac{1}{\epsilon}) samples to approximate x to \ell_2 error \epsilon with probability of success at least 2/3, for \epsilon sufficiently small as a function of d. In contrast, for such \epsilon, x can be recovered to error \epsilon with probability 1o(1) with memory O\left(d^2 \log(1/\epsilon)\right) using d examples. This represents the first nontrivial lower bounds for regression with superlinear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization. 
We characterize heavytraffic process and steadystate limits for systems staffed according to the squareroot safety rule, when the service requirements of the customers are perfectly correlated with their individual patience for waiting in queue. Under the usual manyserver diffusion scaling, we show that the system is asymptotically equivalent to a system with no abandonment. In particular, the limit is the HalfinWhitt diffusion for the [Formula: see text] queue when the traffic intensity approaches its critical value 1 from below, and is otherwise a transient diffusion, despite the fact that the prelimit is positive recurrent. To obtain a refined measure of the congestion due to the correlation, we characterize a lowerorder fluid (LOF) limit for the case in which the diffusion limit is transient, demonstrating that the queue in this case scales like [Formula: see text]. Under both the diffusion and LOF scalings, we show that the stationary distributions converge weakly to the timelimiting behavior of the corresponding process limit. Funding: This work was supported by the National Natural Science Foundation of China [Grant 72188101] and the Division of Civil, Mechanical and Manufacturing Innovation [Grants 1763100 and 2006350].more » « less