NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Better-than-KL PAC-Bayes Bounds

Kuzborskij, Ilja; Jun, Kwang-Sung; Wu, Yulian; Jang, Kyoungseok; Orabona, Francesco (June 2024, Proceedings of the Conference on Learning Theory (COLT))

Full Text Available
Better-than-KL PAC-Bayes Bounds

Kuzborskij, Ilja; Jun, Kwang-Sung; Wu, Yulian; Jang, Kyoungseok; Orabona, Francesco (June 2024, Proceedings of the Conference on Learning Theory (COLT))

Full Text Available
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Meterez, Alexandru; Joudaki, Amir; Orabona, Francesco; Immer, Alexander; Rätsch, Gunnar; Daneshmand, Hadi (May 2024, International Conference on Learning Representations)

Full Text Available
Generalized Implicit Follow-The-Regularized-Leader

Chen, Keyi; Orabona, Francesco (July 2023, International Conference on Machine Learning)

We propose a new class of online learning algorithms, generalized implicit Follow-The-Regularized-Leader (FTRL), that expands the scope of FTRL framework. Generalized implicit FTRL can recover known algorithms, such as FTRL with linearized losses and implicit FTRL, and it allows the design of new update rules, as extensions of aProx and Mirror-Prox to FTRL. Our theory is constructive in the sense that it provides a simple unifying framework to design updates that directly improve the worst-case upper bound on the regret. The key idea is substituting the linearization of the losses with a Fenchel-Young inequality. We show the flexibility of the framework by proving that some known algorithms, like the Mirror-Prox updates, are instantiations of the generalized implicit FTRL. Finally, the new framework allows us to recover the temporal variation bound of implicit OMD, with the same computational complexity.
more » « less
Full Text Available
Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion

Cutkosky, Ashok; Mehta, Harsh; Orabona, Francesco (July 2023, International Conference on Machine Learning)

We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a (δ,ϵ)-stationary point from O(ϵ^(-4),δ^(-1)) stochastic gradient queries to O(ϵ^(-3),δ^(-1)), which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of O(ϵ^(-1.5),δ^(-0.5)). Our techniques also recover all optimal or best-known results for finding ϵ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.
more » « less
Full Text Available
Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion

Cutkosky, Ashok; Mehta, Harsh; Orabona, Francesco (July 2023, Proceedings of Machine Learning Research)
Tighter PAC-Bayes Bounds Through Coin-Betting

Jang, Kyoungseok; Jun, Kwang-Sung; Kuzborskii, Ilja; Orabona, Francesco (July 2023, Conference on Learning Theory)

We consider the problem of estimating the mean of a sequence of random elements f (θ, X_1) , . . . , f (θ, X_n) where f is a fixed scalar function, S = (X_1, . . . , X_n) are independent random variables, and θ is a possibly S-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on n examples where f is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions f , for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the PAC-Bayes framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve even tighter guarantees. Our approach is based on the coin-betting framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by relaxing it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.
more » « less
Full Text Available
Tight Concentrations and Confidence Sequences From the Regret of Universal Portfolio

https://doi.org/10.1109/TIT.2023.3330187

Orabona, Francesco; Jun, Kwang-Sung (January 2024, IEEE Transactions on Information Theory)
Robustness to Unbounded Smoothness of Generalized SignSGD

Crawshaw, Michael; Liu, Mingrui; Orabona, Francesco; Zhang, Wei; Zhuang, Zhenxun (November 2022, Advances in neural information processing systems)
Oh, Alice H.; Agarwal, Alekh; Belgrave, Danielle; Cho, Kyunghyun (Ed.)
Traditional analyses in non-convex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradient-clipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adam-type algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGD-type and Adam-type algorithms: it not only reduces the effects of noise, thus removing the need for large mini-batch in previous analyses of SignSGD-type algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. To the best of our knowledge, this work is the first one showing the benefit of Adam-type algorithms compared with non-adaptive gradient algorithms such as gradient descent in the unbounded smoothness setting. We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating others.
more » « less
Full Text Available
Understanding AdamW through Proximal Methods and Scale-Freeness

Zhuang, Zhenxun; Liu, Mingrui; Cutkosky, Ashok; Orabona, Francesco (August 2022, Transactions on machine learning research)

Full Text Available

« Prev Next »

Search for: All records