As more and more people are expected to work with complex AI-systems, it becomes more important than ever that such systems provide intuitive explanations for their decisions. A prerequisite for holding such explanatory dialogue is the ability of the systems to present their proposed decisions to the user in an easy-to-understand form. Unfortunately, such dialogues could become hard to facilitate in real-world problems where the system may be planning for multiple eventualities in stochastic environments. This means for the system to be effective, it needs to be able to present the policy at a high-level of abstraction and delve into details as required. Towards this end, we investigate the utility of temporal abstractions derived through analytically computed landmarks and their relative ordering to build a summarization of policies for Stochastic Shortest Path Problems. We formalize the concept of policy landmarks and show how it can be used to provide a high level overview of a given policy. Additionally, we establish the connections between the type of hierarchy we generate and previous works in temporal abstractions, specifically MaxQ hierarchies. Our approach is evaluated through user studies as well as empirical metrics that establish that people tend to choose landmarks facts as subgoals to summarize policies and demonstrates the performance of our approach on standard benchmarks.
more »
« less
TLdR: Policy Summarization for Factored SSP Problems Using Temporal Abstractions
As more and more people are expected to work with complex AI-systems, it becomes more important than ever that such systems provide intuitive explanations for their decisions. A prerequisite for holding such explanatory dialogue is the ability of the systems to present their proposed decisions to the user in an easy-to-understand form. Unfortunately, such dialogues could become hard to facilitate in real-world problems where the system may be planning for multiple eventualities in stochastic environments. This means for the system to be effective, it needs to be able to present the policy at a high-level of abstraction and delve into details as required. Towards this end, we investigate the utility of temporal abstractions derived through analytically computed landmarks and their relative ordering to build a summarization of policies for Stochastic Shortest Path Problems. We formalize the concept of policy landmarks and show how it can be used to provide a high level overview of a given policy. Additionally, we establish the connections between the type of hierarchy we generate and previous works in temporal abstractions, specifically MaxQ hierarchies. Our approach is evaluated through user studies as well as empirical metrics that establish that people tend to choose landmarks facts as subgoals to summarize policies and demonstrates the performance of our approach on standard benchmarks.
more »
« less
- Award ID(s):
- 1936997
- PAR ID:
- 10285705
- Editor(s):
- J. Christopher Beck, Olivier Buffet
- Date Published:
- Journal Name:
- Proceedings of the International Conference on Automated Planning and Scheduling
- ISSN:
- 2334-0843
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study adaptive video streaming for multiple users in wireless access edge networks with unreliable channels. The key challenge is to jointly optimize the video bitrate adaptation and resource allocation such that the users' cumulative quality of experience is maximized. This problem is a finite-horizon restless multi-armed multi-action bandit problem and is provably hard to solve. To overcome this challenge, we propose a computationally appealing index policy entitled Quality Index Policy, which is well-defined without the Whittle indexability condition and is provably asymptotically optimal without the global attractor condition. These two conditions are widely needed in the design of most existing index policies, which are difficult to establish in general. Since the wireless access edge network environment is highly dynamic with system parameters unknown and time-varying, we further develop an index-aware reinforcement learning (RL) algorithm dubbed QA-UCB. We show that QA-UCB achieves a sub-linear regret with a low-complexity since it fully exploits the structure of the Quality Index Policy for making decisions. Extensive simulations using real-world traces demonstrate significant gains of proposed policies over conventional approaches. We note that the proposed framework for designing index policy and index-aware RL algorithm is of independent interest and could be useful for other large-scale multi-user problems.more » « less
-
Reverse logistics has been gaining recognition in practice (and theory) for helping companies better match supply with demand, and thus reduce costs in their supply chains. In this paper, we study reverse logistics from the perspective of a supply chain in which each location can initiate multiple flows of product. Our first objective is to jointly optimize ordering decisions pertaining to regular, reverse and expedited flows of product in a stochastic, multi-stage inventory model of a logistics supply chain, in which the physical transformation of the product is completed at the most upstream location in the system. Due to those multiple flows of product, the feasible region for the problem acquires multi-dimensional boundaries that lead to the curse of dimensionality. To address this challenge, we develop a different solution method that allows us to reduce the dimensionality of the feasible region and, subsequently, identify the structure of the optimal policy. We refer to this policy as a nested echelon base-stock policy, as decisions for different product flows are sequentially nested within each other. We show that this policy renders the model analytically and numerically tractable. Our results provide actionable policies for firms to jointly manage three different product flows in their supply chains, and allow us arrive at insights regarding the main drivers of the value of reverse logistics. One of our key findings is that, when it comes to the value generated by reverse logistics, demand variability (i.e., demand uncertainty across periods) matters more than demand volatility (i.e., demand uncertainty within each period). This is because, in the absence of demand variability, it is effectively never optimal to return product upstream, regardless of the level of inherent demand volatility. Our second objective is to extend our analysis to product transforming-supply chains, in which product transformation is allowed to occur at each location. In such a system, it becomes necessary to keep track of both the location and stage of completion of each unit of inventory, so that the number of state and decisions variables increases with the square of the number of locations in the system. To analyze such a supply chain, we first identify a policy that provides a lower bound on the total cost. Then, we establish a special decomposition of the objective cost function that allows us to propose a novel heuristic policy. We find that the performance gap of our heuristic policy relative to the lower-bounding policy averages less than 5% across a range of parameters and supply chain lengths.more » « less
-
Poupart, Pascal (Ed.)Incentive design, also known as model design or environment design for Markov decision processes(MDPs), refers to a class of problems in which a leader can incentivize his follower by modifying the follower's reward function, in anticipation that the follower's optimal policy in the resulting MDP can be desirable for the leader's objective. In this work, we propose gradient-ascent algorithms to compute the leader's optimal incentive design, despite the lack of knowledge about the follower's reward function. First, we formulate the incentive design problem as a bi-level optimization problem and demonstrate that, by the softmax temporal consistency between the follower's policy and value function, the bi-level optimization problem can be reduced to single-level optimization, for which a gradient-based algorithm can be developed to optimize the leader's objective. We establish several key properties of incentive design in MDPs and prove the convergence of the proposed gradient-based method. Next, we show that the gradient terms can be estimated from observations of the follower's best response policy, enabling the use of a stochastic gradient-ascent algorithm to compute a locally optimal incentive design without knowing or learning the follower's reward function. Finally, we analyze the conditions under which an incentive design remains optimal for two different rewards which are policy invariant. The effectiveness of the proposed algorithm is demonstrated using a small probabilistic transition system and a stochastic gridworld.more » « less
-
As technology is advancing, accessibility is also taken care of seriously. Many users with visual disabilities take advantage of, for example, Microsoft's Seeing AI application (app) that is equipped with artificial intelligence. The app helps people with visual disabilities to recognize objects, people, texts, and many more via a smartphone's built-in camera. As users may use the app in recognizing personally identifiable information, user privacy should carefully be treated and considered as a top priority. Yet, little is known about the user privacy issues among users with visual disabilities, such that this study aims to address the knowledge gap by conducting a questionnaire with the Seeing AI users with visual disabilities. This study found that those with visual disabilities had a lack of knowledge about user privacy policies. It is recommended to offer an adequate educational training; thus, those with visual disabilities can be well informed of user privacy policies, ultimately leading to promoting safe online behavior to protect themselves from digital privacy and security problems.more » « less
An official website of the United States government

