 Award ID(s):
 1901252
 NSFPAR ID:
 10349120
 Date Published:
 Journal Name:
 Conference on Computational Learning Theory
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Loh, P ; Raginsky, M. (Ed.)We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic firstorder algorithms. We devise a novel algorithm, referred to as \emph{Recursive OneOverT SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves stateoftheart performance in both a finitesample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leadingorder terms that match the optimal statistical risk with a unity prefactor, along with a higherorder term that scales at the sharp rate of $O(n^{3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, onepoint Hessian continuity condition is imposed, the rescaled last iterate of (multiepoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}rRao optimal asymptotic covariance, for a broad range of stepsize choices.more » « less

We study stochastic approximation procedures for approximately solving a $d$dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a nonasymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a nonasymptotic instancedependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a nonasymptotic minimax lower bound that establishes the instanceoptimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise—covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$—and linear autoregressive models. Our instancedependent characterizations open the door to the design of finegrained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).more » « less

We study an inventory management mechanism that uses two stochastic programs (SPs), the customary one‐period assemble‐to‐order (ATO) model and its relaxation, to conceive control policies for dynamic ATO systems. We introduce a class of ATO systems, those that possess what we call a “chained BOM.” We prove that having a chained BOM is a sufficient condition for both SPs to be
convex in the first‐stage decision variables. We show by examples the necessity of the condition. For ATO systems with a chained BOM, our result implies that the optimal integer solutions of the SPs can be found efficiently, and thus expedites the calculation of control parameters. The M system is a representative chained BOM system with two components and three products. We show that in this special case, the SPs can be solved as a one‐stage optimization problem. The allocation policy can also be reduced to simple, intuitive instructions, of which there are four distinct sets, one for each of four different parameter regions. We highlight the need for component reservation in one of these four regions. Our numerical studies demonstrate that achieving asymptotic optimality represents a significant advantage of the SP‐based approach over alternative approaches. Our numerical comparisons also show that outside of the asymptotic regime, the SP‐based approach has a commanding lead over the alternative policies. Our findings indicate that the SP‐based approach is a promising inventory management strategy that warrants further development for more general systems and practical implementations. 
This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tailaveraging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problemdependent extent to which minibatching can be used to yield provable nearlinear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this minibatching characterization is utilized in providing a highly parallelizable SGD method that achieves the min imax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGDstyle methods. Following this, a nonasymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided. Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with misspecified models in the nonrealizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the misspecified case must depend on the noise properties. The analysis tools used by this paper generalize the operator view of averaged SGD (De ́fossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to char acterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.more » « less

We propose a novel framework for analyzing convergence rates of stochastic optimization algorithms with adaptive step sizes. This framework is based on analyzing properties of an underlying generic stochastic process; in particular, we derive a bound on the expected stopping time of this process. We utilize this framework to analyze the expected global convergence rates of a stochastic variant of a traditional trustregion method. Although traditional trustregion methods rely on exact computations of the gradient, Hessian, and values of the objective function, this method assumes that these values are available only up to some dynamically adjusted accuracy. Moreover, this accuracy is assumed to hold only with some sufficiently large—but fixed—probability without any additional restrictions on the variance of the errors. This setting applies, for example, to standard stochastic optimization and machine learning formulations. Improving upon prior analysis, we show that the stochastic process defined by the trustregion method satisfies the assumptions of our proposed general framework. The stopping time in this setting is defined by an iterate satisfying a firstorder accuracy condition. We demonstrate the first global complexity bound for a stochastic trustregion method under the assumption of sufficiently accurate stochastic gradients. Finally, we apply the same framework to derive secondorder complexity bounds under additional assumptions. Previousmore » « less