skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality
Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.  more » « less
Award ID(s):
2007117 2003257
PAR ID:
10346173
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Uncertainty in artificial intelligence
ISSN:
1525-3384
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The configuration balancing problem with stochastic requests generalizes well-studied resource allocation problems such as load balancing and virtual circuit routing. There are givenmresources andnrequests; each request has multiple possibleconfigurations, each of which increases the load of each resource by some amount. The goal is to select one configuration for each request to minimize themakespan: the load of the most-loaded resource. In the stochastic setting, the amount by which a configuration increases the resource load is uncertain until the configuration is chosen, but we are given a probability distribution. We develop both offline and online algorithms for configuration balancing with stochastic requests. When the requests are known offline, we give a non-adaptive policy for configuration balancing with stochastic requests that$$O(\frac{\log m}{\log \log m})$$ O ( log m log log m ) -approximates the optimal adaptive policy, which matches a known lower bound for the special case of load balancing on identical machines. When requests arrive online in a list, we give a non-adaptive policy that is$$O(\log m)$$ O ( log m ) competitive. Again, this result is asymptotically tight due to information-theoretic lower bounds for special cases (e.g., for load balancing on unrelated machines). Finally, we show how to leverage adaptivity in the special case of load balancing onrelatedmachines to obtain a constant-factor approximation offline and an$$O(\log \log m)$$ O ( log log m ) -approximation online. A crucial technical ingredient in all of our results is a new structural characterization of the optimal adaptive policy that allows us to limit the correlations between its decisions. 
    more » « less
  2. While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the l1 distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage. 
    more » « less
  3. Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction. 
    more » « less
  4. A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance. 
    more » « less
  5. The online list update problem is defined as follows: we are given a list of items and the cost to access any particular item is its position from the start of the list. A sequence of item accesses come online, and our goal is to dynamically reorder the list so that the aggregate access cost is small. We study the stochastic version of the problem where the items are accessed i.i.d. from an unknown distribution p. The study of the stochastic version goes back at least 60 years to McCabe. In this paper, we first consider the simple online algorithm which swaps an accessed item with the item right before it, unless it is at the very front. This algorithm is known as the Transposition rule. We theoretically analyze the stationary behavior of Transposition and prove that its performance is within 1+o(1) factor of the optimal offline algorithm for access sequences sampled from heavy-tailed distributions, proving a conjecture of Rivest from 1976. While the stationary behavior of the Transposition rule is theoretically optimal in the aforementioned i.i.d setting, it can catastrophically fail under adversarial access sequences where only the last and second to last items are repeatedly accessed. A desirable outcome would be a policy that performs well under both circumstances. To achieve this, we use reinforcement learning to design an adaptive policy that performs well for both the i.i.d. setting and the above-mentioned adversarial access. Unsurprisingly, the learned policy appears to be an interpolation between Move-to-Front and Transposition with its behavior closer to Move-to-Front for adversarial access sequences and closer to Transposition for sequences sampled from heavy tailed distributions suggesting that the policy is adaptive and capable of responding to patterns in the access sequence. 
    more » « less