Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Realworld robot task planning is intractable in part due to partial observability. A common approach to reducing complexity is introducing additional structure into the decision process, such as mixedobservability, factored states, or temporallyextended actions. We propose the locally observable Markov decision process, a novel formulation that models tasklevel planning where uncertainty pertains to objectlevel attributes and where a robot has subroutines for seeking and accurately observing objects. This models sensors that are rangelimited and lineofsight—objects occluded or outside sensor range are unobserved, but the attributes of objects that fall within sensor view can be resolved via repeated observation. Our model results in a threestage planning process: first, the robot plans using only observed objects; if that fails, it generates a target object that, if observed, could result in a feasible plan; finally, it attempts to locate and observe the target, replanning after each newly observed object. By combining LOMDPs with offtheshelf Markov planners, we outperform stateoftheart solvers for both objectoriented POMDP and MDP analogues with the same task specification. We then apply the formulation to successfully solve a task on a mobile robot.more » « lessFree, publiclyaccessible full text available May 1, 2025

An agent learning an option in hierarchical reinforcement learning must solve three problems: identify the option’s subgoal (termination condition), learn a policy, and learn where that policy will succeed (initiation set). The termination condition is typically identified first, but the option policy and initiation set must be learned simultaneously, which is challenging because the initiation set depends on the option policy, which changes as the agent learns. Consequently, data obtained from option execution becomes invalid over time, leading to an inaccurate initiation set that subsequently harms downstream task performance. We highlight three issues—data nonstationarity, temporal credit assignment, and pessimism—specific to learning initiation sets, and propose to address them using tools from offpolicy value estimation and classification. We show that our method learns higherquality initiation sets faster than existing methods (in MINIGRID and MONTEZUMA’S REVENGE), can automatically discover promising grasps for robot manipulation (in ROBOSUITE), and improves the performance of a stateoftheart option discovery method in a challenging maze navigation task in MuJoCo.more » « lessFree, publiclyaccessible full text available December 1, 2024

It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76.69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.more » « lessFree, publiclyaccessible full text available October 1, 2024

We propose a novel parameterized skilllearning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in longhorizon tasks. We propose to leverage offpolicy MetaRL combined with a trajectorycentric smoothness term to learn a set of parameterized skills. Our agent can use these learned skills to construct a threelevel hierarchical framework that models a Temporallyextended Parameterized Action Markov Decision Process. We empirically demonstrate that the proposed algorithms enable an agent to solve a set of difficult longhorizon (obstaclecourse and robot manipulation) tasks.more » « lessFree, publiclyaccessible full text available July 1, 2024

We introduce RLang, a domainspecific language (DSL) for communicating domain knowledge to an RL agent. Unlike existing RL DSLs that ground to single elements of a decisionmaking formalism (e.g., the reward function or policy), RLang can specify information about every element of a Markov decision process. We define precise syntax and grounding semantics for RLang, and provide a parser that grounds RLang programs to an algorithmagnostic partial world model and policy that can be exploited by an RL agent. We provide a series of example RLang programs demonstrating how different RL methods can exploit the resulting knowledge, encompassing modelfree and modelbased tabular algorithms, policy gradient and valuebased methods, hierarchical approaches, and deep methods.more » « lessFree, publiclyaccessible full text available July 1, 2024

We propose a new method for countbased exploration in highdimensional state spaces. Unlike previous work which relies on density models, we show that counts can be derived by averaging samples from the Rademacher distribution (or coin flips). This insight is used to set up a simple supervised learning objective which, when optimized, yields a state’s visitation count. We show that our method is significantly more effective at deducing groundtruth visitation counts than previous work; when used as an exploration bonus for a modelfree reinforcement learning algorithm, it outperforms existing approaches on most of 9 challenging exploration tasks, including the Atari game MONTEZUMA’S REVENGE.more » « lessFree, publiclyaccessible full text available June 1, 2024

In the HiddenParameter MDP (HiPMDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiPMDP is a natural model for families of tasks in which meta and lifelongreinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiPMDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyperparameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.more » « lessFree, publiclyaccessible full text available May 1, 2024

Principled decisionmaking in continuous stateaction spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Qfunction. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarsegrained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Qfunctions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains.more » « less

We present Qfunctionals, an alternative architecture for continuous control deep reinforcement learning. Instead of returning a single value for a stateaction pair, our network transforms a state into a function that can be rapidly evaluated in parallel for many actions, allowing us to efficiently choose highvalue actions through sampling. This contrasts with the typical architecture of offpolicy continuous control, where a policy network is trained for the sole purpose of selecting actions from the Qfunction. We represent our actiondependent Qfunction as a weighted sum of basis functions (Fourier, Polynomial, etc) over the action space, where the weights are statedependent and output by the Qfunctional network. Fast sampling makes practical a variety of techniques that require MonteCarlo integration over Qfunctions, and enables actionselection strategies besides simple valuemaximization. We characterize our framework, describe various implementations of Qfunctionals, and demonstrate strong performance on a suite of continuous control tasks.more » « less

We propose a modelbased lifelong reinforcementlearning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a samplebased Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuousstate domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent modelbased deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than stateoftheart lifelong RL methodsmore » « less