Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms—DORM, DORM+, and AdaHedgeD—arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models. more »« less
Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORM+, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.
Wang, Junxiong; Basu, Debabrota; Trummer, Immanuel(
, Proceedings of the AAAI Conference on Artificial Intelligence)
In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In real-life, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multi-fidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noise-tolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayed-UCB1 (DUCB1) and delayed-UCB-V (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of non-delayed HOO for expected delay of O(log T), and worsens by T^((1-α)/(d+2)) for expected delays of O(T^(1-α)) for α ∈ (0,1]. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the state-of-the-art black-box optimization methods for feedbacks with different noise levels, delays, and fidelity.
Cutkosky, Ashok; Mehta, Harsh; Orabona, Francesco(
, International Conference on Machine Learning)
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a (δ,ϵ)-stationary point from O(ϵ^(-4),δ^(-1)) stochastic gradient queries to O(ϵ^(-3),δ^(-1)), which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of O(ϵ^(-1.5),δ^(-0.5)). Our techniques also recover all optimal or best-known results for finding ϵ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.
Characterizing the performance of no-regret dynamics in multi-player games is a foundational problem
at the interface of online learning and game theory. Recent results have revealed that when all players
adopt specific learning algorithms, it is possible to improve exponentially over what is predicted by
the overly pessimistic no-regret framework in the traditional adversarial regime, thereby leading
to faster convergence to the set of coarse correlated equilibria (CCE) – a standard game-theoretic
equilibrium concept. Yet, despite considerable recent progress, the fundamental complexity barriers
for learning in normal- and extensive-form games are poorly understood. In this paper, we make a
step towards closing this gap by first showing that – barring major complexity breakthroughs – any
polynomial-time learning algorithms in extensive-form games need at least 2log1/2−o(1) |T | iterations
for the average regret to reach below even an absolute constant, where |T | is the number of nodes in
the game. This establishes a superpolynomial separation between no-regret learning in normal- and
extensive-form games, as in the former class a logarithmic number of iterations suffices to achieve
constant average regret. Furthermore, our results imply that algorithms such as multiplicative
weights update, as well as its optimistic counterpart, require at least 2(log logm)1/2−o(1) iterations to
attain an O(1)-CCE in m-action normal-form games under any parameterization. These are the
first non-trivial – and dimension-dependent – lower bounds in that setting for the most well-studied
algorithms in the literature. From a technical standpoint, we follow a beautiful connection recently
made by Foster, Golowich, and Kakade (ICML ’23) between sparse CCE and Nash equilibria in
the context of Markov games. Consequently, our lower bounds rule out polynomial-time algorithms
well beyond the traditional online learning framework, capturing techniques commonly used for
accelerating centralized equilibrium computation.
Xiong, Zhihan; Shen, Ruoqi; Cui, Qiwen; Fazel, Maryam; Du, Simon S.(
, Advances in neural information processing systems)
We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case O~(H√SAT) regret bound for episodic time-inhomogeneous Markov Decision Process where S is the size of state space, A is the size of action space, H is the planning horizon and T is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the Ω(H√SAT) lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.
Flaspohler, Genevieve E, Orabona, Francesco, Cohen, Judah, Mouatadid, Soukayna, Oprescu, Miruna, Orenstein, Paulo, and Mackey, Lester. Online Learning with Optimism and Delay. Retrieved from https://par.nsf.gov/biblio/10310892. Proceedings of Machine Learning Research 139.
Flaspohler, Genevieve E, Orabona, Francesco, Cohen, Judah, Mouatadid, Soukayna, Oprescu, Miruna, Orenstein, Paulo, and Mackey, Lester.
"Online Learning with Optimism and Delay". Proceedings of Machine Learning Research 139 (). Country unknown/Code not available. https://par.nsf.gov/biblio/10310892.
@article{osti_10310892,
place = {Country unknown/Code not available},
title = {Online Learning with Optimism and Delay},
url = {https://par.nsf.gov/biblio/10310892},
abstractNote = {Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms—DORM, DORM+, and AdaHedgeD—arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.},
journal = {Proceedings of Machine Learning Research},
volume = {139},
author = {Flaspohler, Genevieve E and Orabona, Francesco and Cohen, Judah and Mouatadid, Soukayna and Oprescu, Miruna and Orenstein, Paulo and Mackey, Lester},
editor = {Meila, Marina and Zhang, Tong}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.