We consider the online linear optimization problem, where at every step the algorithm plays a point x_t in the unit ball, and suffers loss for some cost vector c_t that is then revealed to the algorithm. Recent work showed that if an algorithm receives a "hint" h_t that has non-trivial correlation with c_t before it plays x_t, then it can achieve a logarithmic regret guarantee, improving on the classical sqrt(T) bound. In this work, we study the question of whether an algorithm really requires a hint at every time step. Somewhat surprisingly, we show that an algorithm can obtain logarithmic regret with just O(sqrt(T)) hints under a natural query model. We give two applications of our result, to the well-studied setting of optimistic regret bounds and to the problem of online learning with abstention. 
                        more » 
                        « less   
                    
                            
                            Finite Time Logarithmic Regret Bounds for Self-Tuning Regulation
                        
                    
    
            We establish the first finite-time logarithmic regret bounds for the self-tuning regulation problem. We introduce a modified version of the certainty equivalence algorithm, which we call PIECE, that clips inputs in addition to utilizing probing inputs for exploration. We show that it has a ClogT upper bound on the regret after T time-steps for bounded noise, and Clog3T in the case of sub-Gaussian noise, unlike the LQ problem where logarithmic regret is shown to be not possible. The PIECE algorithm is also designed to address the critical challenge of poor initial transient performance of reinforcement learning algorithms for linear systems. Comparative simulation results illustrate the improved performance of PIECE. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2038625
- PAR ID:
- 10563426
- Publisher / Repository:
- Proceedings of the 41st International Conference on Machine Learning
- Date Published:
- Page Range / eLocation ID:
- 235:45571-45636
- Format(s):
- Medium: X
- Location:
- Vienna, Austria
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Ruiz, Francisco and (Ed.)We consider the problem of universal dynamic regret minimization under exp-concave and smooth losses. We show that appropriately designed Strongly Adaptive algorithms achieve a dynamic regret of $$\tilde O(d^2 n^{1/5} [\mathcal{TV}_1(w_{1:n})]^{2/5} \vee d^2)$$, where $$n$$ is the time horizon and $$\mathcal{TV}_1(w_{1:n})$$ a path variational based on second order differences of the comparator sequence. Such a path variational naturally encodes comparator sequences that are piece-wise linear – a powerful family that tracks a variety of non-stationarity patterns in practice (Kim et al., 2009). The aforementioned dynamic regret is shown to be optimal modulo dimension dependencies and poly-logarithmic factors of $$n$$. To the best of our knowledge, this path variational has not been studied in the non-stochastic online learning literature before. Our proof techniques rely on analysing the KKT conditions of the offline oracle and requires several non-trivial generalizations of the ideas in Baby and Wang (2021) where the latter work only implies an $$\tilde{O}(n^{1/3})$$ regret for the current problem.more » « less
- 
            Abstract We consider the problem of unconstrained online convex optimization (OCO) with sub- exponential noise, a strictly more general problem than the standard OCO. In this setting, the learner receives a subgradient of the loss functions corrupted by sub-exponential noise and strives to achieve optimal regret guarantee, without knowledge of the competitor norm, i.e., in a parameter-free way. Recently, Cutkosky and Boahen (COLT 2017) proved that, given unbounded subgradients, it is impossible to guarantee a sublinear regret due to an exponential penalty. This paper shows that it is possible to go around the lower bound by allowing the observed subgradients to be unbounded via stochastic noise. However, the presence of unbounded noise in unconstrained OCO is challenging; existing algorithms do not provide near-optimal regret bounds or fail to have a guarantee. So, we design a novel parameter-free OCO algorithm for Banach space, which we call BANCO, via a reduction to betting on noisy coins. We show that BANCO achieves the optimal regret rate in our problem. Finally, we show the application of our results to obtain a parameter-free locally private stochastic subgradient descent algorithm, and the connection to the law of iterated logarithms.more » « less
- 
            We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.more » « less
- 
            How can non-communicating agents learn to share congested resources efficiently? This is a challeng- ing task when the agents can access the same resource simultaneously (in contrast to multi-agent multi-armed bandit problems) and the resource valuations differ among agents. We present a fully distributed algorithm for learning to share in congested environments and prove that the agents’ regret with respect to the optimal allocation is poly-logarithmic in the time horizon. Performance in the non-asymptotic regime is illustrated in numerical simulations. The distributed algorithm has applications in cloud computing and spectrum sharing.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    