skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The DOI auto-population feature in the Public Access Repository (PAR) will be unavailable from 4:00 PM ET on Tuesday, July 8 until 4:00 PM ET on Wednesday, July 9 due to scheduled maintenance. We apologize for the inconvenience caused.


Search for: All records

Creators/Authors contains: "Mokhtari, Aryan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Stochastic second-order methods are known to achieve fast local convergence in strongly convex optimization by relying on noisy Hessian estimates to precondition the gradient. Yet, most of these methods achieve superlinear convergence only when the stochastic Hessian noise diminishes, requiring an increase in the per-iteration cost as time progresses. Recent work in \cite{na2022hessian} addressed this issue via a Hessian averaging scheme that achieves a superlinear convergence rate without increasing the per-iteration cost. However, the considered method exhibits a slow global convergence rate, requiring up to ~O(κ^2) iterations to reach the superlinear rate of ~O((1/t)^{t/2}), where κ is the problem's condition number. In this paper, we propose a novel stochastic Newton proximal extragradient method that significantly improves these bounds, achieving a faster global linear rate and reaching the same fast superlinear rate in ~O(κ) iterations. We achieve this by developing a novel extension of the Hybrid Proximal Extragradient (HPE) framework, which simultaneously achieves fast global and local convergence rates for strongly convex functions with access to a noisy Hessian oracle. 
    more » « less
    Free, publicly-accessible full text available December 10, 2025
  2. An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an r-dimensional subspace within the d k r-dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of d. In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all r ground-truth features. 
    more » « less
    Free, publicly-accessible full text available July 31, 2025
  3. The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$. 
    more » « less
  4. Second-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lower-dimensional subspace, giving rise to \textit{subspace second-order} methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $$d$$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $$\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$$ for solving convex optimization problems. Here, $$m$$ represents the subspace dimension, which can be significantly smaller than $$d$$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems. 
    more » « less
  5. In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of $${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$$, where $$d$$ is the problem dimension and $$k$$ is the number of iterations. In particular, in the regime where $$k = \bigO(d)$$, our method matches the \emph{optimal rate} of $$\mathcal{O}(\frac{1}{k^2})$$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $$k = \Omega(d \log d)$$, it outperforms NAG and converges at a \emph{faster rate} of $$\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices. 
    more » « less
  6. Oh, A; Naumann, T; Globerson, A; Saenko, K; Hardt, M; Levine, S (Ed.)
  7. Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. In particular, no current analysis for quasi-Newton methods guarantees global convergence with an explicit superlinear convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and we relate the bounded regret of the online problem to the superlinear convergence of our method. 
    more » « less
  8. Ruiz, Francisco; Dy, Jennifer; van de Meent, Jan-Willem (Ed.)
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires $${O}(\max\{1/\epsilon_f,1/\epsilon_g\})$$ iterations to find a solution that is $$\epsilon_f$$-optimal for the upper-level objective and $$\epsilon_g$$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires $${O}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$$ iterations to find an $$(\epsilon_f,\epsilon_g)$$-optimal solution. We also prove stronger convergence guarantees under the Holderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems. 
    more » « less