Utilizing task-invariant prior knowledge extracted from related tasks, meta-learning is a principled framework that empowers learning a new task especially when data records are limited. A fundamental challenge in meta-learning is how to quickly "adapt" the extracted prior in order to train a task-specific model within a few optimization steps. Existing approaches deal with this challenge using a preconditioner that enhances convergence of the per-task training process. Though effective in representing locally a quadratic training loss, these simple linear preconditioners can hardly capture complex loss geometries. The present contribution addresses this limitation by learning a nonlinear mirror map, which induces a versatile distance metric to enable capturing and optimizing a wide range of loss geometries, hence facilitating the per-task training. Numerical tests on few-shot learning datasets demonstrate the superior expressiveness and convergence of the advocated approach.
more »
« less
Generating reward structures on a parameterized distribution of dynamics tasks
In order to make lifelike, versatile learning adaptive in the artificial domain, one needs a very diverse set of behaviors to learn. We propose a parameterized distribution of classic control-style tasks with minimal information shared between tasks. We discuss what makes a task trivial and offer a basic metric, time in convergence, that measures triviality. We then investigate analytic and empirical approaches to generating reward structures for tasks based on their dynamics in order to minimize triviality. Contrary to our expectations, populations evolved on reward structures that incentivized the most stable locations in state space spend the least time in convergence as we have defined it, because of the outsized importance our metric assigns to behavior fine-tuning in these contexts. This work paves the way towards an understanding of which task distributions enable the development of learning.
more »
« less
- Award ID(s):
- 1845322
- PAR ID:
- 10286536
- Editor(s):
- Cejkova, Jitka; Holler, Silvia; Soros, Lisa; Witkowski, Olaf
- Date Published:
- Journal Name:
- Artificial Life Conference Proceedings
- Volume:
- Alife 2021
- Issue:
- 2021
- Page Range / eLocation ID:
- 118-127
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.more » « less
-
Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task-phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task-phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.more » « less
-
We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions’ truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions’ truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.more » « less
-
Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine some interaction types. Some methods do so by assuming that the robot has prior information about the features of the task and the reward structure. By contrast, in this article, we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s input to nearby alternatives, i.e., trajectories close to the human’s feedback. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: We enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study, we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at https://youtu.be/FSUJsTYvEKUmore » « less
An official website of the United States government

