 Award ID(s):
 1703574
 Publication Date:
 NSFPAR ID:
 10276109
 Journal Name:
 Advances in neural information processing systems
 Issue:
 33
 ISSN:
 10495258
 Sponsoring Org:
 National Science Foundation
More Like this

The difficulty in specifying rewards for many real world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand riskaverse behavior. We derive a novel policy gradientstyle robust optimization approach, PGBROIL, that optimizes a softrobust objective that balances expected performance and risk. To the best of our knowledge, PGBROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PGBROIL can produce a family of behaviors ranging from riskneutral to riskaverse and outperforms stateoftheart imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.

We consider reinforcement learning (RL) in episodic MDPs with adversarial fullinformation reward feedback and unknown fixed transition kernels. We propose two modelfree policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the nonstationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of nonstationarity, and moreover satisfies a notion of adaptive (near)optimality, in the sense that it matches the (near)optimal static regret under slowchanging environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with nonstationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to nonstationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of modelfree RL algorithms in nonstationary environments.

Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For
discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$$\gamma $$ $\gamma $ and the effective horizon$${\mathcal {S}}$$ $S$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\frac{1}{1\gamma }$$ $\frac{1}{1\gamma}$ can take$$\eta $$ $\eta $ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization inmore »$$\begin{aligned} \frac{1}{\eta } {\mathcal {S}}^{2^{\Omega \big (\frac{1}{1\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{array}{c}\frac{1}{\eta}{\leftS\right}^{{2}^{\Omega (\frac{1}{1\gamma})}}\phantom{\rule{0ex}{0ex}}\text{iterations}\end{array}$ 
This paper studies a dynamic pricing problem under model misspecification. To characterize model misspecification, we adopt the εcontamination model—the most fundamental model in robust statistics and machine learning. In particular, for a selling horizon of length T, the online εcontamination model assumes that demands are realized according to a typical unknown demand function only for [Formula: see text] periods. For the rest of [Formula: see text] periods, an outlier purchase can happen with arbitrary demand functions. The challenges brought by the presence of outlier customers are mainly due to the fact that arrivals of outliers and their exhibited demand behaviors are completely arbitrary, therefore calling for robust estimation and exploration strategies that can handle any outlier arrival and demand patterns. We first consider unconstrained dynamic pricing without any inventory constraint. In this case, we adopt the FollowtheRegularizedLeader algorithm to hedge against outlier purchase behavior. Then, we introduce inventory constraints. When the inventory is insufficient, we study a robust bisectionsearch algorithm to identify the clearance price—that is, the price at which the initial inventory is expected to clear at the end of T periods. Finally, we study the general dynamic pricing case, where a retailer has no clue whether the inventorymore »

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsicmore »