The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$$\gamma $$ $\gamma $ and the effective horizon$${\mathcal {S}}$$ $S$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\frac{1}{1\gamma }$$ $\frac{1}{1\gamma}$ can take$$\eta $$ $\eta $ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization inmore »$$\begin{aligned} \frac{1}{\eta } {\mathcal {S}}^{2^{\Omega \big (\frac{1}{1\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{array}{c}\frac{1}{\eta}{\leftS\right}^{{2}^{\Omega (\frac{1}{1\gamma})}}\phantom{\rule{0ex}{0ex}}\text{iterations}\end{array}$ 
Free, publiclyaccessible full text available November 16, 2023

Free, publiclyaccessible full text available August 30, 2023

Free, publiclyaccessible full text available April 1, 2023

We study a noisy tensor completion problem of broad practical interest, namely, the reconstruction of a lowrank tensor from highly incomplete and randomly corrupted observations of its entries. Whereas a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for largescale applications or come with suboptimal statistical guarantees. Focusing on “incoherent” and wellconditioned tensors of a constant canonical polyadic rank, we propose a twostage nonconvex algorithm—(vanilla) gradient descent following a rough initialization—that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all individual tensor factors within nearly linear time, while at the same time enjoying nearoptimal statistical guarantees (i.e., minimal sample complexity and optimal estimation accuracy). The estimation errors are evenly spread out across all entries, thus achieving optimal [Formula: see text] statistical accuracy. We also discuss how to extend our approach to accommodate asymmetric tensors. The insight conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.Free, publiclyaccessible full text available March 1, 2023

We study the problem of active learning with the added twist that the learner is assisted by a helpful teacher. We consider the following natural interaction protocol: At each round, the learner proposes a query asking for the label of an instance xq, the teacher provides the requested label {xq,yq} along with explanatory information to guide the learning process. In this paper, we view this information in the form of an additional contrastive example ({xc,yc}) where xc is picked from a set constrained by xq (e.g., dissimilar instances with the same label). Our focus is to design a teaching algorithm that can provide an informative sequence of contrastive examples to the learner to speed up the learning process. We show that this leads to a challenging sequence optimization problem where the algorithm's choices at a given round depend on the history of interactions. We investigate an efficient teaching algorithm that adaptively picks these contrastive examples. We derive strong performance guarantees for our algorithm based on two problemdependent parameters and further show that for specific types of active learners (e.g., a generalized binary search learner), the proposed teaching algorithm exhibits strong approximation guarantees. Finally, we illustrate our bounds and demonstrate themore »

Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms in contemporary reinforcement learning. This class of methods is often applied in conjunction with entropy regularization—an algorithmic scheme that encourages exploration—and is closely related to soft policy iteration and trust region policy optimization. Despite the empirical success, the theoretical underpinnings for NPG methods remain limited even for the tabular setting. This paper develops nonasymptotic convergence guarantees for entropyregularized NPG methods under softmax parameterization, focusing on discounted Markov decision processes (MDPs). Assuming access to exact policy evaluation, we demonstrate that the algorithm converges linearly—even quadratically, once it enters a local region around the optimal policy—when computing optimal value functions of the regularized MDP. Moreover, the algorithm is provably stable visàvis inexactness of policy evaluation. Our convergence results accommodate a wide range of learning rates and shed light upon the role of entropy regularization in enabling fast convergence.

Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning in the context of finitehorizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the stateaction space. In addition, a variancereduced pessimistic Qlearning algorithm is proposed to achieve nearoptimal sample complexity. Altogether, this work highlights the efficiency of modelfree algorithms in offline RL when used in conjunction with pessimism and variance reduction.