This paper investigates online stochastic planning for problems with large factored state and action spaces. One promising approach in recent work estimates the quality of applicable actions in the current state through aggregate simulation from the states they reach. This leads to significant speedup, compared to search over concrete states and actions, and suffices to guide decision making in cases where the performance of a random policy is informative of the quality of a state. The paper makes two significant improvements to this approach. The first, taking inspiration from lifted belief propagation, exploits the structure of the problem to derive a more compact computation graph for aggregate simulation. The second improvement replaces the random policy embedded in the computation graph with symbolic variables that are optimized simultaneously with the search for high quality actions. This expands the scope of the approach to problems that require deep search and where information is lost quickly with random steps. An empirical evaluation shows that these ideas significantly improve performance, leading to state of the art performance on hard planning problems.
more »
« less
DiSProD: Differentiable Symbolic Propagation of Distributions for Planning
The paper introduces DiSProD, an online planner developed forenvironments with probabilistic transitions in continuous state andaction spaces. DiSProD builds a symbolic graph that captures thedistribution of future trajectories, conditioned on a given policy,using independence assumptions and approximate propagation ofdistributions. The symbolic graph provides a differentiablerepresentation of the policy's value, enabling efficient gradient-basedoptimization for long-horizon search. The propagation of approximatedistributions can be seen as an aggregation of many trajectories, makingit well-suited for dealing with sparse rewards and stochasticenvironments. An extensive experimental evaluation compares DiSProD tostate-of-the-art planners in discrete-time planning and real-timecontrol of robotic systems. The proposed method improves over existingplanners in handling stochastic environments, sensitivity to searchdepth, sparsity of rewards, and large action spaces. Additionalreal-world experiments demonstrate that DiSProD can control groundvehicles and surface vessels to successfully navigate around obstacles.
more »
« less
- Award ID(s):
- 2246261
- PAR ID:
- 10499803
- Publisher / Repository:
- International Joint Conferences on Artificial Intelligence
- Date Published:
- ISBN:
- 978-1-956792-03-4
- Page Range / eLocation ID:
- 5324 to 5332
- Format(s):
- Medium: X
- Location:
- Macau, SAR China
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. The algorithm uses product distributions to approximate the belief state and shows how to build a representation graph of an approximate action-value function over belief space. The graph captures the result of simulating the model in aggregate under independence assumptions, giving a symbolic representation of the value function. The algorithm supports large observation spaces using sampling networks, a representation of the process of sampling values of observations, which is integrated into the graph representation. Following previous work in MDPs this approach enables action selection in POMDPs through gradient optimization over the graph representation. This approach complements recent algorithms for POMDPs which are based on particle representations of belief states and an explicit search for action selection. Our approach enables scaling to large factored action spaces in addition to large state spaces and observation spaces. An experimental evaluation demonstrates that the algorithm provides excellent performance relative to state of the art in large POMDP problems.more » « less
-
It is well known that the problems of stochastic planning and probabilistic inference are closely related. This paper makes two contributions in this context. The first is to provide an analysis of the recently developed SOGBOFA heuristic planning algorithm that was shown to be effective for problems with large factored state and action spaces. It is shown that SOGBOFA can be seen as a specialized inference algorithm that computes its solutions through a combination of a symbolic variant of belief propagation and gradient ascent. The second contribution is a new solver for Marginal MAP (MMAP) inference. We introduce a new reduction from MMAP to maximum expected utility problems which are suitable for the symbolic computation in SOGBOFA. This yields a novel algebraic gradient-based solver (AGS) for MMAP. An experimental evaluation illustrates the potential of AGS in solving difficult MMAP problems.more » « less
-
The metric dimension of a graph is the smallest number of nodes required to identify allother nodes uniquely based on shortest path distances. Applications of metric dimensioninclude discovering the source of a spread in a network, canonically labeling graphs, andembedding symbolic data in low-dimensional Euclidean spaces. This survey gives a self-contained introduction to metric dimension and an overview of the quintessential resultsand applications. We discuss methods for approximating the metric dimension of generalgraphs, and specific bounds and asymptotic behavior for deterministic and random familiesof graphs. We conclude with related concepts and directions for future work.more » « less
-
Learning optimal policies in real-world domains with delayed rewards is a major challenge in Reinforcement Learning. We address the credit assignment problem by proposing a Gaussian Process (GP)-based immediate reward approximation algorithm and evaluate its effectiveness in 4 contexts where rewards can be delayed for long trajectories. In one GridWorld game and 8 Atari games, where immediate rewards are available, our results showed that on 7 out 9 games, the proposed GP inferred reward policy performed at least as well as the immediate reward policy and significantly outperformed the corresponding delayed reward policy. In e-learning and healthcare applications, we combined GP-inferred immediate rewards with offline Deep Q-Network (DQN) policy induction and showed that the GP-inferred reward policies outperformed the policies induced using delayed rewards in both real-world contexts.more » « less
An official website of the United States government

