skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, April 16 until 2:00 AM ET on Friday, April 17 due to maintenance. We apologize for the inconvenience.


Title: One Sample Stochastic Frank-Wolfe
One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of for reaching an -suboptimal solution in the stochastic convex setting, and a approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an -first-order stationary point after at most iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration.  more » « less
Award ID(s):
1943064 1934876 1837253
PAR ID:
10671260
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
AISTATS
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Recently decentralized optimization attracts much attention in machine learning because it is more communication-efficient than the centralized fashion. Quantization is a promising method to reduce the communication cost via cutting down the budget of each single communication using the gradient compression. To further improve the communication efficiency, more recently, some quantized decentralized algorithms have been studied. However, the quantized decentralized algorithm for nonconvex constrained machine learning problems is still limited. Frank-Wolfe (a.k.a., conditional gradient or projection-free) method is very efficient to solve many constrained optimization tasks, such as low-rank or sparsity-constrained models training. In this paper, to fill the gap of decentralized quantized constrained optimization, we propose a novel communication-efficient Decentralized Quantized Stochastic Frank-Wolfe (DQSFW) algorithm for non-convex constrained learning models. We first design a new counterexample to show that the vanilla decentralized quantized stochastic Frank-Wolfe algorithm usually diverges. Thus, we propose DQSFW algorithm with the gradient tracking technique to guarantee the method will converge to the stationary point of non-convex optimization safely. In our theoretical analysis, we prove that to achieve the stationary point our DQSFW algorithm achieves the same gradient complexity as the standard stochastic Frank-Wolfe and centralized Frank-Wolfe algorithms, but has much less communication cost. Experiments on matrix completion and model compression applications demonstrate the efficiency of our new algorithm. 
    more » « less
  2. This paper considers the fundamental convergence time for opportunistic scheduling over time-varying channels. The channel state probabilities are unknown and algorithms must perform some type of estimation and learning while they make decisions to optimize network utility. Existing schemes can achieve a utility within ε of optimality, for any desired ε > 0, with convergence and adaptation times of O(1/ε^2). This paper shows that if the utility function is concave and smooth, then O(log(1/ε)/ε) convergence time is possible via an existing stochastic variation on the Frank-Wolfe algorithm, called the RUN algorithm. Next, a converse result is proven to show it is impossible for any algorithm to have convergence time better than O(1/ε), provided the algorithm has no a- priori knowledge of channel state probabilities. Hence, RUN is within a logarithmic factor of convergence time optimality. However, RUN has a vanishing stepsize and hence has an infinite adaptation time. Using stochastic Frank-Wolfe with a fixed step- size yields improved O(1/ε^2) adaptation time, but convergence time increases to O(1/ε^2), similar to existing drift-plus-penalty based algorithms. This raises important open questions regarding optimal adaptation. 
    more » « less
  3. This paper considers the fundamental convergence time for opportunistic scheduling over time-varying channels. The channel state probabilities are unknown and algorithms must perform some type of estimation and learning while they make decisions to optimize network utility. Existing schemes can achieve a utility within ε of optimality, for any desired ε > 0, with convergence and adaptation times of O(1/ε^2). This paper shows that if the utility function is concave and smooth, then O(log(1/ε)/ε) convergence time is possible via an existing stochastic variation on the Frank-Wolfe algorithm, called the RUN algorithm. Next, a converse result is proven to show it is impossible for any algorithm to have convergence time better than O(1/ε), provided the algorithm has no a- priori knowledge of channel state probabilities. Hence, RUN is within a logarithmic factor of convergence time optimality. However, RUN has a vanishing stepsize and hence has an infinite adaptation time. Using stochastic Frank-Wolfe with a fixed step- size yields improved O(1/ε^2) adaptation time, but convergence time increases to O(1/ε^2), similar to existing drift-plus-penalty based algorithms. This raises important open questions regarding optimal adaptation. 
    more » « less
  4. In this paper, we present the first explicit and non-asymptotic global convergence rates of the BFGS method when implemented with an inexact line search scheme satisfying the Armijo-Wolfe conditions. We show that BFGS achieves a global linear convergence rate of (1−1κ)t for μ-strongly convex functions with L-Lipschitz gradients, where κ=Lμ represents the condition number. Additionally, if the objective function's Hessian is Lipschitz, BFGS with the Armijo-Wolfe line search achieves a linear convergence rate that depends solely on the line search parameters, independent of the condition number. We also establish a global superlinear convergence rate of ((1t)t). These global bounds are all valid for any starting point x0 and any symmetric positive definite initial Hessian approximation matrix B0, though the choice of B0 impacts the number of iterations needed to achieve these rates. By synthesizing these results, we outline the first global complexity characterization of BFGS with the Armijo-Wolfe line search. Additionally, we clearly define a mechanism for selecting the step size to satisfy the Armijo-Wolfe conditions and characterize its overall complexity. 
    more » « less
  5. Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J.; Herman, Grzegorz (Ed.)
    Boob et al. [Boob et al., 2020] described an iterative peeling algorithm called Greedy++ for the Densest Subgraph Problem (DSG) and conjectured that it converges to an optimum solution. Chekuri, Qaunrud and Torres [Chandra Chekuri et al., 2022] extended the algorithm to supermodular density problems (of which DSG is a special case) and proved that the resulting algorithm Super-Greedy++ (and hence also Greedy++) converges. In this paper we revisit the convergence proof and provide a different perspective. This is done via a connection to Fujishige’s quadratic program for finding a lexicographically optimal base in a (contra) polymatroid [Satoru Fujishige, 1980], and a noisy version of the Frank-Wolfe method from convex optimization [Frank and Wolfe, 1956; Jaggi, 2013]. This yields a simpler convergence proof, and also shows a stronger property that Super-Greedy++ converges to the optimal dense decomposition vector, answering a question raised in Harb et al. [Harb et al., 2022]. A second contribution of the paper is to understand Thorup’s work on ideal tree packing and greedy tree packing [Thorup, 2007; Thorup, 2008] via the Frank-Wolfe algorithm applied to find a lexicographically optimum base in the graphic matroid. This yields a simpler and transparent proof. The two results appear disparate but are unified via Fujishige’s result and convex optimization. 
    more » « less