<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ADDESS: Advice Explanations in Complex Repeated Decision-Making Environments</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI)</publisher>
				<date>08/15/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10595829</idno>
					<idno type="doi"></idno>
					
					<author>Sören Schleibaum</author><author>Lu Feng</author><author>Sarit Kraus</author><author>Jörg P Müller</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In the evolving landscape of human-centered AI, fostering a synergistic relationship between humans and AI agents in decision-making processes stands as a paramount challenge. This work considers a problem setup where an intelligent agent comprising a neural network-based prediction component and a deep reinforcement learning component provides advice to a human decision-maker in complex repeated decision-making environments. Whether the human decision-maker would follow the agent's advice depends on their beliefs and trust in the agent and on their understanding of the advice itself. To this end, we developed an approach named ADESSE to generate explanations about the adviser agent to improve human trust and decision-making. Computational experiments on a range of environments with varying model sizes demonstrate the applicability and scalability of ADESSE. Furthermore, an interactive gamebased user study shows that participants were significantly more satisfied, achieved a higher reward in the game, and took less time to select an action when presented with explanations generated by ADESSE. These findings illuminate the critical role of tailored, human-centered explanations in AI-assisted decision-making.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Making complex decisions repeatedly in a dynamic environment is very challenging for humans. An intelligent agent can support human decision-making by providing advice. We consider an adviser agent consisting of two components as shown in Figure <ref type="figure">1</ref>. At each step, the agent first makes some predictions about the future, and then computes advice based on the prediction and the current state using deep reinforcement learning (DRL). Such adviser agents can and are being used in many real-world applications: For example, providing advice to police officers scheduled through place-based predictive policing <ref type="bibr">[Meijer and Wessels, 2019]</ref>, providing advice to taxi drivers based on the prediction of future pickup requests from passengers <ref type="bibr">[Farazi et al., 2021]</ref>, or provid- ing advice to firefighters based on the prediction of wildfire risk <ref type="bibr">[Julian and Kochenderfer, 2019]</ref>.</p><p>Studies have found that the degree to which humans follow an intelligent agent's advice depends on their beliefs about the agent's performance on a given task <ref type="bibr">[Vodrahalli et al., 2022]</ref>, and that providing explanations improves humans' acceptance and trust in the agent's advice <ref type="bibr">[Zhang et al., 2020b;</ref><ref type="bibr">Shin, 2021]</ref>. Hence, this work aims at generating explanations about the adviser agent to improve human's trust and decision-making.</p><p>Existing methods for explaining AI-based systems mostly treat the entire system as a black-box model; the generated explanations could be in different output formats <ref type="bibr">(e.g., numerical, textual, visual)</ref>, but each method usually only focuses on one type of explanations <ref type="bibr">[Adadi and Berrada, 2018;</ref><ref type="bibr">Guidotti et al., 2018;</ref><ref type="bibr">Speith, 2022]</ref>. For example, there are several methods (e.g., <ref type="bibr">[Ribeiro et al., 2016;</ref><ref type="bibr">Lundberg and Lee, 2017]</ref>) explaining the feature importance of prediction; and there is a growing body of research on explainable reinforcement learning <ref type="bibr">[Vouros, 2022;</ref><ref type="bibr">Wells and Bednarz, 2021;</ref><ref type="bibr">Heuillet et al., 2021;</ref><ref type="bibr">Puiutta and Veith, 2020]</ref>. Nevertheless, to the best of our knowledge, none of the prior works generates explanations for both prediction and DRL.</p><p>In this work, we present a novel approach named ADESSE (ADvice ExplanationS in complex repeated deciSion-making Environments) 1 . ADESSE peeks inside the black-box model of an adviser agent, leveraging the agent's two-component structure to generate explanations with both textual and vi-sual information. Specifically, an explanation generated by ADESSE includes three key elements: (1) a short list of topranked input features that contribute the most to the agent's prediction; (2) a heatmap visualizing domain-specific indices summarizing the DRL input features; and (3) arrows in various shades of gray overlaying the heatmap to illustrate a trained DRL policy with state importance.</p><p>A key innovation of ADESSE is to generate informative explanations that capture multiple aspects of the adviser agent, from the prediction input to the DRL input to the trained DRL policy. Furthermore, ADESSE reduces the explanation size via selecting top-ranked input features of the prediction and using domain-specific indices to succinctly explain DRL input features.</p><p>We adopt LIME <ref type="bibr">[Ribeiro et al., 2016]</ref>, a popular method for explaining black-box models, as a baseline for comparison. LIME generates explanations represented as (multiple) saliency maps visualizing each input feature's influence on the agent's advice, which can be overwhelming when there is a large number of input features. We hypothesize that explanations generated by ADESSE can be more effective in assisting human decision-making than the baseline.</p><p>Computational experiments demonstrate that ADESSE can be successfully applied to a range of environments and scales over varying model sizes. In all cases, ADESSE generates smaller explanations using less time, compared with LIME.</p><p>Additionally, we conduct an interactive game-based user study to evaluate the effectiveness of generated explanations. Study results show that participants were significantly more satisfied, achieved a higher reward in the game, and took less time to select an action when presented with explanations generated by ADESSE rather than the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Position within the XAI Literature</head><p>The research field of explainable artificial intelligence (XAI) has been growing rapidly in recent years, attracting increasing attention <ref type="bibr">[Adadi and Berrada, 2018;</ref><ref type="bibr">Guidotti et al., 2018;</ref><ref type="bibr">Speith, 2022;</ref><ref type="bibr">Saeed and Omlin, 2023;</ref><ref type="bibr">Anjomshoae et al., 2019]</ref>. Here, we position this work based on a taxonomy of XAI methods described in <ref type="bibr">[Speith, 2022]</ref>.</p><p>First, depending on the stage when explanations are generated, there are ante-hoc and post-hoc methods. This work belongs to the latter since ADESSE generates explanations after the agent has been trained.</p><p>Second, there are model-specific and model-agnostic methods. ADESSE is agnostic to the underlying machine learning techniques for prediction and advice computation.</p><p>Third, the scope of explanations can be global or local. An explanation generated by ADESSE consists of three key elements, in which the first element (i.e., a list of top-ranked features for the prediction at a grid cell) is local and the other two elements (i.e., domain-specific indices and arrows for visualizing a DRL policy) are global.</p><p>Moreover, XAI methods generate explanations in diverse output formats, including numerical, textual, visual, rules, models, etc. ADESSE generates explanations displayed visually as a heatmap together with textual information about a short list of top-ranked features.</p><p>Last but not least, the lack of user studies is a major limitation across many existing XAI works, as pointed out in several survey papers <ref type="bibr">[Wells and Bednarz, 2021;</ref><ref type="bibr">Kraus et al., 2020;</ref><ref type="bibr">Chakraborti et al., 2020]</ref>. This work overcomes this limitation by adopting an interactive game-based user study for evaluation.</p><p>At first glance, the motivating examples described in the next section seem similar to the task of goal recognition. However, in contrast to goal recognition (see <ref type="bibr">[Shvo and McIlraith, 2020]</ref>), we have time-dependent targets and do not learn a probability distribution over goals. Consequently, we cannot base our work on those explaining goal recognition, e.g. <ref type="bibr">[Alshehri et al., 2023]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Feature Importance</head><p>Many XAI methods explain black-box models via computing feature importance (e.g., how much a feature contributes to a prediction). Local Interpretable Model-agnostic Explanations (LIME) <ref type="bibr">[Ribeiro et al., 2016]</ref> and SHapley Additive exPlanations (SHAP) <ref type="bibr">[Lundberg and Lee, 2017]</ref> are two of the most popular methods in this category. LIME focuses on training local surrogate models to explain individual predictions. This method works by first generating a new dataset comprising perturbed samples and the corresponding predictions of the black box model, and then using this new dataset to train an interpretable surrogate model that is weighted by the proximity of the sampled instances to the instance of interest. The learned surrogate model can provide a good approximation of local predictions, but does not necessarily guarantee global accuracy.</p><p>On the other hand, SHAP computes Shapley values of features (i.e., the average marginal contribution of a feature value across all possible coalitions) by considering all possible predictions for an instance using all possible combinations of inputs. Because of this exhaustive analysis, SHAP can take much longer computation time than LIME. The authors of <ref type="bibr">[Lundberg and Lee, 2017]</ref> show that SHAP can guarantee properties such as accuracy and consistency, while LIME is a subset of SHAP but lacks these properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Explainable Reinforcement Learning</head><p>Explainable reinforcement learning (XRL) has emerged as a sub-field of XAI with a growing body of research <ref type="bibr">[Vouros, 2022;</ref><ref type="bibr">Wells and Bednarz, 2021;</ref><ref type="bibr">Heuillet et al., 2021;</ref><ref type="bibr">Puiutta and Veith, 2020]</ref>. Existing XRL methods can be distinguished by the scope of explanations. Some methods provide explanations about policy-level behaviors, while others explain specific, local decisions (e.g., "Why does the agent select this but not that action in a state?"). Although this work seeks to explain the agent's advice for the current state, we do not restrict to local explanations. The proposed ADESSE approach provides a policy-level explanation that shows what the agent's advice would be in different states with varying features, which can help the human decision-maker better understand the agent's behavior, rather than providing a local explanation about the advised action only. Thus, ADESSE intrinsically aims at increasing the humans' trust in the adviser agent (cf. <ref type="bibr">[Shin, 2021]</ref>).</p><p>Various types of policy-level explanations have been developed in prior works. For example, a video highlighting the agent's trajectories with important states is proposed in <ref type="bibr">[Amir and Amir, 2018]</ref>; such trajectory summaries are augmented with saliency maps in <ref type="bibr">[Huber et al., 2021]</ref>. Abstracted policy graphs (i.e., Markov chains of abstract states) are introduced in [Topin and Veloso, 2019] for summarizing RL policies. A chart illustrating the agent coordination and task ordering is used for policy summarization of multi-agent RL in <ref type="bibr">[Boggess et al., 2022]</ref>. Additionally, policy-level contrastive explanations (e.g., "Why does the agent follow this but not that policy?") have been considered in <ref type="bibr">[Sreedharan et al., 2022;</ref><ref type="bibr">Finkelstein et al., 2022;</ref><ref type="bibr">Boggess et al., 2023]</ref>.</p><p>To the best of our knowledge, however, none of the existing XRL methods uses a heatmap of domain-specific indices to summarize DRL input features as in ADESSE. Furthermore, we overlay the heatmap with arrows visualizing (advised) optimal actions based on a trained RL policy and annotate these arrows with different shades of gray to indicate the importance degrees of states. We follow the notion of state importance originally proposed in <ref type="bibr">[Torrey and Taylor, 2013]</ref>, which was adopted in [Amir and Amir, 2018] for summarizing the RL agent's behavior in a selected set of important states. By contrast, our explanation shows the agent's action in every state but highlights importance states with darker arrows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Explainable Recommendations</head><p>There is a related line of work on explainable recommendations <ref type="bibr">[Zhang et al., 2020a;</ref><ref type="bibr">Vultureanu-Albis &#184;i and B&#515;dic&#515;, 2022;</ref><ref type="bibr">Naiseh et al., 2020]</ref>, which refers to recommendation algorithms that not only provide recommendation results, but also explanations to clarify why such items are recommended. For example, image and text-based explanations are generated in <ref type="bibr">[Yan et al., 2023]</ref> by first selecting a personalized image set that is the most relevant to a user's interest toward a recommended item and then producing natural language explanations. User needs for explanations of recommendations are investigated in <ref type="bibr">[Tran et al., 2023]</ref>, where studies find that users in high-involvement domains (e.g., selecting a car to buy) focus more on explanations compared to lower-involvement domains (e.g., selecting a movie to watch).</p><p>This work seeks to explain the agent's advice, which can be considered as a type of recommendation; and ADESSE also provides explanations with both visual and textual information. However, our problem setup is different from those recommendation algorithms, which usually do not consider repeated decision-making in complex environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Setup</head><p>We consider a problem setup where an intelligent agent comprising a neural network-based prediction component and a deep reinforcement learning (DRL) component provides advice to a human decision-maker in complex repeated decision-making environments.</p><p>As illustrated in Figure <ref type="figure">1</ref>, at each step, the agent makes some prediction &#375; about the future based on the current state s and historical data h, and generates an advice a based on a trained DRL policy with the input s and &#375;, and reward r; the human decision-maker takes an action a 0 where a 0 = a if the human follows the agent's advice. But sometimes, an alternative action (a 0 6 = a) may be chosen if the human does not trust the agent or does not understand why the agent proposes a certain advice.</p><p>This work aims to tackle this problem by generating explanations about the adviser agent to improve the human's trust and decision-making. We make two important assumptions as follows.</p><p>&#8226; A1: The agent is rational (i.e., seeking to maximize the expected discounted return) and not adversarial to the human decision-maker (i.e., no deception). &#8226; A2: The environment is based on a grid representation with discrete states and actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Motivating Examples</head><p>The aforementioned problem setup is commonly shared by many complex repeated decision-making environments. Here we describe two motivating examples used in this work.</p><p>Taxi environment. Consider a taxi moving around in a grid world. In our example scenario, we assume the grid size to be 20&#8677;20. The taxi can stay put or move horizontally, vertically, or diagonally by up to two grid cells at each step; we assume that one step corresponds to ten minutes real time. The taxi receives a reward of 10 for dropping off a passenger and a penalty of 1 per step for driving without any passenger. In each episode, the taxi starts at a random grid cell and time and terminates by the end of a nine-hour shift (i.e., 54 steps). At each step of an episode, the adviser agent predicts the number of pick-up requests in each grid cell for the next step, based on a rich set of features, including the number of pickup requests of the last 40 minutes, points of interest in each cell, as well as location-independent features such as date, time, holiday, and weather. Then, the agent advises an action for the taxi based on a DRL policy trained using the number of predicted pick-up requests and available taxis in each grid cell and the received reward.</p><p>The taxi driver decides whether to follow the agent's advice or take an alternative action, which would impact the environment's feedback of state and reward. The above process repeats until the end of an episode.</p><p>Wildfire environment. Consider an aerial vehicle (AV) flying over a forest (modeled as a grid world) aiming to extinguish a wildfire. At each step (corresponding to 2.5 minutes), the AV can choose one of three types of actions: (1) extinguish the fire in the current grid cell, (2) stay put, or (3) decide to relocate by one cell in either of the four cardinal directions. When the AV chooses the extinguish action in a grid cell with a high neighborhood fire ratio, the AV receives a large positive reward that is calculated based on the neighborhood fire ratio. The AV receives a penalty of 2.5 for taking the extinguish action in a grid cell with a low neighborhood fire ratio, and a cost of 1 per step for moving around. In each episode, the AV starts at a random grid cell and terminates after 100 steps.</p><p>At each step, the adviser agent predicts the fire risk (i.e., the probability of fire occurrences) in each grid cell, based on features including each grid cell's forest fuel level and burning status. Then, the agent advises an action for the AV based on a DRL policy trained using the fire risk prediction, the current state and the received reward.</p><p>The AV operator decides whether to follow the agent's advice, which would also affect the state of the environment. The above process repeats until an episode terminates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Baseline Explanations</head><p>The baseline explainer considers a black-box model consisting of the adviser agent's two components as a whole and explains input features' influence on the output advice. We apply LIME <ref type="bibr">[Ribeiro et al., 2016]</ref> to check what happens to the agent's advice when the input features are perturbed and compute an influence value for each feature. We select LIME to generate baseline explanations (cf. Section 3.2), because SHAP is too slow for computational experiments. SHAP yields time-out (i.e., more than two minutes) for most models used in our experiments, while LIME and the proposed ADESSE approach can generate explanations within a few seconds. The generated baseline explanations are represented as saliency maps showing how much each feature contributes to the agent's advice.</p><p>For an example saliency map illustrating the influence of each grid cell's current pick-up request counts on the agent's advice see the Appendix of <ref type="bibr">[Schleibaum et al., 2024]</ref>. The baseline explanation generated at each step may include multiple saliency maps corresponding to different For example, there are five saliency maps generated for the taxi environment per step. We hypothesize that such a baseline explanation is overwhelming and cannot effectively assist humans with decision-making.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Approach</head><p>To address the limitations of baseline explanations, we propose an approach named ADESSE that leverages the problem structure and generates explanations consisting of three key elements as shown in Figure <ref type="figure">2</ref>. First, a list of top-ranked features is selected based on their contributions to the prediction (cf. Section 4.1). Second, a domain-specific index function is used to summarize the DRL input features (cf. Section 4.2). Third, the trained DRL policy is visualized as arrows in a grid world with importance degrees (cf. Section 4.3). And finally, we describe how ADESSE generates an explanation integrating these elements (cf. Section 4.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Top-Ranked Features for the Prediction</head><p>We rank input features of the prediction component based on Shapley values computed via SHAP <ref type="bibr">[Lundberg and Lee, 2017]</ref>, which tells us the contribution of each feature to the prediction. We favor SHAP over LIME here, because identifying the top-ranked features for a few selected predictions with the smaller search space allows a fast computation time and we want the properties guaranteed by SHAP.</p><p>To reduce the explanation size, we focus on selecting a short list of top-ranked features for an individual prediction output at a time. For example, for the taxi environment, the human decision-maker may be interested to know what are the top six features contributing to the pick-up request prediction at the taxi's current location or the advised next location. Such explanations could improve the human's trust in the agent's prediction component. During a game-based user study (cf. Section 6), the human decision-maker can choose from a list of locations (e.g., grid cells labeled with A-F in Figure <ref type="figure">3</ref>) for displaying the topranked features that contribute the most to the prediction in each location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Domain-Specific Indices</head><p>To explain the input of the agent's DRL component, we summarize the DRL input features using a domain-specific index function, rather than showing multiple saliency maps (i.e., one for each DRL input feature) as in baseline explanations.</p><p>Taxi environment. Recall from Section 3.1 that the DRL input features for the taxi environment include the number of predicted pick-up requests and available taxis in each grid cell. Inspired by the demand-supply ratio, a metric commonly used in the market for taxi services <ref type="bibr">[Kamga et al., 2015]</ref>, we define an index function for the taxi environment as follows.</p><p>where g 2 G is a grid cell in the taxi grid world, |G| is the total number of grid cells, and &#8674;(g) and &#8999; (g) are the number of predicted requests and available taxis in a grid cell g, respectively. We set &#8984; = 0.75 to balance the trade-off between the demand-supply ratio of taxi services and the ratio of predicted requests in a grid cell g compared with the average requests over the entire grid world G. Figure <ref type="figure">3</ref> shows an example heatmap of the obtained taxi indices where the darkest red indicates that taxi (g) = 0.</p><p>Wildfire environment. For each grid cell g in the forest grid world G, the DRL input features for the wildfire environment include the predicted fire risk &#181;(g) 2 [0, 1], the normalized forest fuel level &#10003;(g) 2 [0, 1] and the burning status (g) 2 {true, f alse}. We define an index function for the wildfire environment based on domain knowledge [Haksar and Schwager, 2018; Julian and Kochenderfer, 2019] as follows.</p><p>fire</p><p>The intuition is that, when a grid cell has caught fire, a higher forest fuel level and higher predicted fire risk would lead to more severe fire, and hence a more negative value of the wildfire index; conversely, when a cell is not on fire, it is safer (i.e., more positive value of the wildfire index) when there is a lower forest fuel level and lower predicted fire risk. An example heatmap of the obtained wildfire indices is shown in the Appendix of <ref type="bibr">[Schleibaum et al., 2024]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Arrows with Importance Degrees</head><p>To improve the human decision-maker's trust in the adviser agent, we visualize the trained DRL policy for the entire grid world rather than only displaying the agent's advice for the current grid cell. Let g t 2 G denote the current grid cell at time t. The DRL state t is given by the environment state s t (which includes g t as a feature) and the predicted state &#375;t . Let t [g] denote a DRL state that replaces g t with a grid cell g 2 G but preserves all the other features of t . For example, in the taxi environment, t [g] represents a state where g is an assumed location of the taxi, and the rest of DRL input features (i.e., number of predicted pick-up requests and available taxis at each grid cell) stay the same as in t . Given a trained DRL policy &#8673; t at time t, the optimal action a(g) in a grid cell g seeks to maximize the Q-value that estimates the rewards ultimately achievable by taking an action in a state.</p><p>where &#8629; denotes any possible action in state t [g].</p><p>Algorithm 1 Generating an explanation at a time step t Input: Grid world G, current grid g t , current state s t , predication input x t and output &#375;t , DRL input t &#10003; s t [ &#375;t and policy &#8673; t Parameter: Optional list of parameters Output: Explanation e t 1: for all g in a finite path starting from g t following &#8673; t do 2:</p><p>F append top-ranked features f (g) &#8674; x t 3: end for 4: for all g 2 G do Compute optimal action a(g) 7:</p><p>Compute the normalized importance degree (g) 8: end for 9: return e t = hF, { (g)} g2G , {a(g), (g)} g2G i Figure <ref type="figure">3</ref> plots the optimal action in each grid cell as an arrow overlaying the index heatmap obtained from Section 4.2. Moreover, we annotate these arrows with various shades of gray to represent the normalized importance degrees. We define the importance degree of each grid cell g 2 G following the notion of state importance proposed in [Torrey and Taylor, 2013]:</p><p>Intuitively, if all actions in a state share the same Q-value, then the state is the least important for advising because it does not matter which action is chosen. We normalize importance degrees I(g) over the entire grid world G and obtain:</p><p>(g) = I(g) min g2G I(g) max g2G I(g) min g2G I(g) such that the normalized importance degree (g) 2 [0, 1].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Explanation Generation Algorithm</head><p>Algorithm 1 illustrates the procedure of ADESSE generating an explanation at a time step t by integrating these aforementioned elements. First, a set of locations along a finite path starting from the current grid g t and following the trained DRL policy &#8673; t is identified (e.g., A-F in Figure <ref type="figure">3</ref>) and a list of top-ranked input features for the prediction in each location is selected as described in Section 4.1. Next, for each grid g 2 G in the grid world, a domain-specific index (e.g., taxi (g) and fire (g) introduced in Section 4.2) is computed to summarize the DRL input features and plotted in a heatmap. Lastly, the optimal action a(g) and the normalized importance degree (g) for each grid g 2 G is computed following Section 4.3, which are plotted as arrows with various shades of gray overlaying the heatmap of indices. The generated explanation is returned as a heatmap as shown in Figure <ref type="figure">3</ref>, together with separate lists of top-ranked features for the prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Computational Experiments</head><p>We build a prototype implementation<ref type="foot">foot_1</ref> of ADESSE and compare its performance with the baseline explainer using LIME (cf. Section 3.2) via computational experiments on the taxi and wildfire environments with varying model sizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Implementation</head><p>Taxi environment. We implemented the prediction component as a feed-forward neural network consisting of five fully connected layers with 20, 128, 64, 32, and 16 neurons; and utilized the dueling double deep Q-learning <ref type="bibr">[Wang et al., 2016]</ref> for the DRL component (three convolutional and three fully connected layers). The New York City Yellow Taxi dataset 3 was used for training and validation (186 million trips taken between January 2015 and June 2016), where the GPS start and end locations of trips were mapped to grid cells in the environment.</p><p>Wildfire environment. For this environment, we implemented the prediction component as a feed-forward neural network with three layers of 6, 512, and 512 neurons; as in the taxi environment, dueling double deep Q-learning <ref type="bibr">[Wang et al., 2016]</ref> was used for the DRL component (three convolutional and three fully connected layers). The environment dynamics (forest fire model) was adapted from <ref type="bibr">[Haksar and Schwager, 2018;</ref><ref type="bibr">Julian and Kochenderfer, 2019]</ref>.</p><p>Setup. All experiments were run on a MacBook laptop with an Apple M1 Pro chip, 32 GB of memory, and Ventura 13.5.2 operating system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>Table <ref type="table">1</ref> shows the experimental results. For each model, we report the grid world size |G|, and compare the baseline and ADESSE in terms of the explanation size and the average time of generating an explanation per step over 10 independent runs. We draw the following key insights from the results:</p><p>&#8226; Both the baseline explainer and ADESSE can successfully generate explanations for different environments with varying model sizes.</p><p>&#8226; The size of ADESSE explanation is significantly smaller than that of the baseline explanation across all models, and the size difference increases as the grid world grows larger.</p><p>3 <ref type="url">https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page</ref> </p><p>&#8226; ADESSE is generally faster than the baseline and can generate an explanation within a few seconds for all models used in the experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Game-Based User Study</head><p>We evaluate the effectiveness of explanations generated by ADESSE via an interactive game-based user study<ref type="foot">foot_2</ref> . We describe the study design in Section 6.1, report the results and discuss the insights in Section 6.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Study Design</head><p>Game. We designed an interactive game based on the taxi environment described in Section 3.1. Each study participant was asked to act as a taxi driver who was incentivized to choose the optimal action in the environment at each step, in order to receive a high reward. The participants were presented with baseline explanations and with explanations generated by ADESSE; our goal was to find to what extent and how this influences their decisions in terms of whether or not to follow the advised actions. An example screenshot of the game user interface is shown in the Appendix of <ref type="bibr">[Schleibaum et al., 2024]</ref>.</p><p>Participants. We recruited 28 participants; all of them were over the age of 18, fluent in English (since the game instructions were written in English), and did not have color blindness (which would have affected their ability to recognize the presented explanations). The average age of the participants was 28.96 years with a standard deviation of 8.27 years<ref type="foot">foot_3</ref> . 39% of the participants were female and 61% male. To ensure data quality, each participant responded to three attention-check questions during the study.</p><p>Independent variables. We adopted a within-subject study design where participants were asked to engage in two study trials, each of which involved playing the game for twelve steps with explanations generated by either ADESSE or by the baseline explainer using LIME. To counterbalance the ordering confound effect, one half of the participants were randomly selected to start the study trial with baseline explanations, followed by a trial with explanations generated by ADESSE; the other half of the participants took the two study trials in reversed order.</p><p>Dependent variables. We recorded the average time spent to choose an action, the total reward achieved in a study trial, and the percentage of steps where the agent's advice was followed in a trial. At the end of each study trial, we also collected the participant ratings on a 5-point Likert scale (1 -strongly disagree, 5 -strongly agree) about the following statements adapted from <ref type="bibr">[Hoffman et al., 2018]</ref> regarding the explanation satisfaction scale:</p><p>&#8226; The explanations help me understand how the agent's advice is computed.</p><p>&#8226; The explanations are satisfying.</p><p>&#8226; The explanations are sufficiently detailed.</p><p>&#8226; The explanations are sufficiently complete, that is, they provide me with all the needed information to make decisions. &#8226; The explanations are actionable, that is, they help me know how to make decisions. &#8226; The explanations let me know how reliable the agent is for decision support. &#8226; The explanations let me know how trustworthy the agent is for decision support.</p><p>Procedure. During the study, each participant was first briefed about the study purpose and the game instructions. Then, the participant was asked to play a study trial with one type of explanation (i.e., baseline or ADESSE) and give ratings on the explanation satisfaction scale. Next, the participant was asked to play a second trial with the other explanation type, followed by explanation satisfaction ratings. The study was wrapped up with demographic questions (e.g., age, gender). Additionally, to gain better insights into the behavior of participants, we asked a randomly selected set of participants to describe what their decision-making strategy was, and give an appraisal of how confident they were to choose a better action than the agent's advice.</p><p>Hypotheses. We investigated three hypotheses stated below.</p><p>&#8226; H1: Explanations generated by ADESSE lead to higher ratings on the explanation satisfaction scale than the baseline. &#8226; H2: Explanations generated by ADESSE enable the participants to take less time to choose actions than the baseline. &#8226; H3: Explanations generated by ADESSE enable the participants to achieve a higher total reward than the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Study Results and Discussion</head><p>We utilized a Wilcoxon signed-rank test to evaluate H1 and used a paired t-test to evaluate H2 and H3. For all tests, we set the significance level as 0.05.</p><p>Explanation satisfaction scale ratings. As shown in Figure <ref type="figure">4</ref>, participants ratings of explanations generated by ADESSE are higher than ratings of the baseline in all explanation satisfaction scale metrics with statistically significant differences. Thus, the data supports H1.</p><p>Time for choosing actions. On average, participants took less time to choose actions when being presented with explanations generated by ADESSE (M = 38.78, SD = 15.90) compared to baseline explanations (M = 52.82, SD = 27.72). The difference is statistically significant (t = 2.9182, p = 0.0070). Thus, the data supports H2.</p><p>Total reward. The participants achieved a higher average reward when being presented with explanations generated by ADESSE (M = 98.18, SD = 13.18) than baseline explanations (M = 90.18, SD = 18.13). However, the paired t-test yields (t = 1.8216, p = 0.0796) with the p value slightly higher than 0.05. Thus, the data partially supports H3.</p><p>Discussion. One of the reasons that participants were more satisfied with explanations generated by ADESSE, as indicated by the higher ratings on the explanation satisfaction scale, could due to the fact that explanations generated by ADESSE are more succinct and informative than baseline explanations (note that there are five saliency maps in each baseline explanation generated by LIME). This may also justify the reason of participants took less time to choose actions with explanations generated by ADESSE, since it requires more time to read and understand baseline explanations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>We presented ADESSE, a novel approach for generating visual and text-based explanations about an intelligent agent that provides advice to a human decision-maker in complex repeated decision-making environments. The agent consists of two deep learning-based components: one for making predictions about the future, and the other for computing advised actions with deep reinforcement learning based on the predicted future and the current state. ADESSE leverages the agent's two-component structure and generates explanations with visual and textual information, to improve the human's trust in the agent and thus better assist human decisionmaking. Results of computational experiments demonstrate the applicability and scalability of ADESSE, while an interactive game-based user study shows the effectiveness of explanations generated by ADESSE. There are several directions to explore for possible future work. First, we will extend ADESSE to be able to deal with environments with continuous state/action space, beyond grid world environments considered in this work. For example, there has been increasing interest in using deep learning to predict future blood glucose levels of diabetes patients and then compute an advised insulin dosage based on the prediction via deep reinforcement learning <ref type="bibr">[Emerson et al., 2023]</ref>. Moreover, we will explore an extension to the multi-agent setting where advice is computed via multi-agent DRL.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>ADESSE means "to aid" in Latin.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://github.com/sorensc/ADESSE</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>The study was approved by institutional review board.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>Note that drivers of private transportation services such as Uber represent the demographic group from which we recruited the subjects for the study.</p></note>
		</body>
		</text>
</TEI>
