 Award ID(s):
 1908918
 Publication Date:
 NSFPAR ID:
 10170563
 Journal Name:
 Proceedings of the American Control Conference
 Page Range or eLocationID:
 17791784
 ISSN:
 07431619
 Sponsoring Org:
 National Science Foundation
More Like this

A new optimal control based representation for stationary action trajectories is constructed by exploiting connections between semiconvexity, semiconcavity, and stationarity. This new representation is used to verify a known twopoint boundary value problem characterization of stationary action.

We consider the problem of offline reinforcement learning (RL)  a wellmotivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose OffPolicy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵoptimal policy with O˜(H2/dmϵ2) episodes of offline data in the finitehorizon stationary transition setting, where H is the horizon length and dm is the minimal marginal stateaction distribution induced by the behavior policy. This improves over the best known upper bound by a factor of H. Moreover, we establish an informationtheoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rateoptimal sample complexity under alternative settings such as the finitehorizon MDPs with nonstationary transitions and the infinite horizon MDPs with discounted rewards.

We consider the imitation learning problem of learning a policy in a Markov Decision Process (MDP) setting where the reward function is not given, but demonstrations from experts are available. Although the goal of imitation learning is to learn a policy that produces behaviors nearly as good as the experts’ for a desired task, assumptions of consistent optimality for demonstrated behaviors are often violated in practice. Finding a policy that is distributionally robust against noisy demonstrations based on an adversarial construction potentially solves this problem by avoiding optimistic generalizations of the demonstrated data. This paper studies Distributionally Robust Imitation Learning (DRoIL) and establishes a close connection between DRoIL and Maximum Entropy Inverse Reinforcement Learning. We show that DRoIL can be seen as a framework that maximizes a generalized concept of entropy. We develop a novel approach to transform the objective function into a convex optimization problem over a polynomial number of variables for a class of loss functions that are additive over state and action spaces. Our approach lets us optimize both stationary and nonstationary policies and, unlike prevalent previous methods, it does not require repeatedly solving an inner reinforcement learning problem. We experimentally show the significant benefits of DRoIL’smore »

The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. The algorithm uses product distributions to approximate the belief state and shows how to build a representation graph of an approximate actionvalue function over belief space. The graph captures the result of simulating the model in aggregate under independence assumptions, giving a symbolic representation of the value function. The algorithm supports large observation spaces using sampling networks, a representation of the process of sampling values of observations, which is integrated into the graph representation. Following previous work in MDPs this approach enables action selection in POMDPs through gradient optimization over the graph representation. This approach complements recent algorithms for POMDPs which are based on particle representations of belief states and an explicit search for action selection. Our approach enables scaling to large factored action spaces in addition to large state spaces and observation spaces. An experimental evaluation demonstrates that the algorithm provides excellent performance relative to state of the art in large POMDP problems.

Interaction conditions can change the balance of cooperation and conflict in multicellular groups. After aggregating together, cells of the social amoeba
Dictyostelium discoideum may migrate as a group (known as a slug) to a new location. We consider this migration stage as an arena for social competition and conflict because the cells in the slug may not be from a genetically homogeneous population. In this study, we examined the interplay of two seemingly diametric actions, the solitary action of kin recognition and the collective action of slug migration inD. discoideum , to more fully understand the effects of social competition on fitness over the entire lifecycle. We compare slugs composed of either genetically homogenous or heterogeneous cells that have migrated or remained stationary in the social stage of the social amoebaDictyostelium discoideum . After migration of chimeric slugs, we found that facultative cheating is reduced, where facultative cheating is defined as greater contribution to spore relative to stalk than found for that clone in the clonal state. In addition our results support previous findings that competitive interactions in chimeras diminish slug migration distance. Furthermore, fruiting bodies have shorter stalks after migration, even accounting for cell numbers at that time. Taken together, these results showmore »