skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Stochastic Contextual Bandits with Long Horizon Rewards
The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T= h) regimes and derive respective regret upper bounds O(d square-root(sT) +min(q, T) and O( square-root(sdT) ), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.  more » « less
Award ID(s):
2023166 2007036
PAR ID:
10443285
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
37
Issue:
8
ISSN:
2159-5399
Page Range / eLocation ID:
9525 to 9533
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We prove that the border rank of the Kronecker square of the little Coppersmith–Winograd tensor Tcw,q is the square of its border rank for q > 2 and that the border rank of its Kronecker cube is the cube of its border rank for q > 4. This answers questions raised implicitly by Coppersmith & Winograd (1990, §11) and explicitly by Bl¨aser (2013, Problem 9.8) and rules out the possibility of proving new upper bounds on the exponent of matrix multiplication using the square or cube of a little Coppersmith–Winograd tensor in this range. In the positive direction, we enlarge the list of explicit tensors potentially useful for Strassen’s laser method, introducing a skew-symmetric version of the Coppersmith– Winograd tensor, Tskewcw,q. For q = 2, the Kronecker square of this tensor coincides with the 3 × 3 determinant polynomial, det3 ∈ C9 ⊗ C9 ⊗ C9, regarded as a tensor. We show that this tensor could potentially be used to show that the exponent of matrix multiplication is two. We determine new upper bounds for the (Waring) rank and the (Waring) border rank of det3, exhibiting a strict submultiplicative behaviour for Tskewcw,2 which is promising for the laser method. We establish general results regarding border ranks of Kronecker powers of tensors, and make a detailed study of Kronecker squares of tensors in C3 ⊗ C3 ⊗ C3. 
    more » « less
  2. We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of [Formula: see text] for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of [Formula: see text]. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest. 
    more » « less
  3. Byrka, Jaroslaw; Meka, Raghu (Ed.)
    In this work, we prove new relations between the bias of multilinear forms, the correlation between multilinear forms and lower degree polynomials, and the rank of tensors over F₂. We show the following results for multilinear forms and tensors. Correlation bounds. We show that a random d-linear form has exponentially low correlation with low-degree polynomials. More precisely, for d = 2^{o(k)}, we show that a random d-linear form f(X₁,X₂, … , X_d) : (F₂^{k}) ^d → F₂ has correlation 2^{-k(1-o(1))} with any polynomial of degree at most d/2 with high probability. This result is proved by giving near-optimal bounds on the bias of a random d-linear form, which is in turn proved by giving near-optimal bounds on the probability that a sum of t random d-dimensional rank-1 tensors is identically zero. Tensor rank vs Bias. We show that if a 3-dimensional tensor has small rank then its bias, when viewed as a 3-linear form, is large. More precisely, given any 3-dimensional tensor T: [k]³ → F₂ of rank at most t, the bias of the 3-linear form f_T(X₁, X₂, X₃) : = ∑_{(i₁, i₂, i₃) ∈ [k]³} T(i₁, i₂, i₃)⋅ X_{1,i₁}⋅ X_{2,i₂}⋅ X_{3,i₃} is at least (3/4)^t. This bias vs tensor-rank connection suggests a natural approach to proving nontrivial tensor-rank lower bounds. In particular, we use this approach to give a new proof that the finite field multiplication tensor has tensor rank at least 3.52 k, which is the best known rank lower bound for any explicit tensor in three dimensions over F₂. Moreover, this relation between bias and tensor rank holds for d-dimensional tensors for any fixed d. 
    more » « less
  4. The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ε-optimal policy is Ω(|S||A|H/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in |S| and |A| due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of Õ((|S|+|A|)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of |S|, |A|, and ε. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function. 
    more » « less
  5. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less