<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Contrastive Explanations That Anticipate Human Misconceptions Can Improve Human Decision-Making Skills</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>04/25/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10651155</idno>
					<idno type="doi">10.1145/3706598.3713229</idno>
					
					<author>Zana Buçinca</author><author>Siddharth Swaroop</author><author>Amanda E Paluch</author><author>Finale Doshi-Velez</author><author>Krzysztof Z Gajos</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Imagine if AI decision-support tools not only improved the quality of our decisions but also enhanced our decision-making skills in the process. Competence, mastery, and skill growth are fundamental drivers of motivation in the workplace and beyond <ref type="bibr">[27,</ref><ref type="bibr">28]</ref>. Individuals are inherently driven to re!ne their abilities in the tasks with which they engage, whether it's making more informed treatment decisions for patients, sharpening writing skills, or mastering a new programming language. The ongoing process of self-improvement not only leads to better outcomes -decisions, papers, or codeit also provides intrinsic satisfaction by ful!lling people's fundamental need for competence <ref type="bibr">[27]</ref>. As AI systems become more integrated into our decision-making tasks, a critical question arises: How will this assistance a"ect our skill growth and competence in decision-making? Speci!cally, as AI systems increasingly o"er ready-made "solutions", do individuals develop and improve the underlying skills needed to evaluate and generate high-quality decisions independently, or do they risk becoming overly reliant on AI recommendations?</p><p>Fueling broader concerns about deskilling <ref type="bibr">[60,</ref><ref type="bibr">100]</ref>, emerging empirical evidence <ref type="bibr">[7,</ref><ref type="bibr">33,</ref><ref type="bibr">80]</ref> suggests that automation and the design of many current AI decision-support systems might not only fail to nurture our skills but could actively degrade them. For instance, Rinta-Kahila et al. <ref type="bibr">[80]</ref> found that people's accounting skill degradation became apparent once a system for !xed assets management was discontinued after years of use. And Gajos and Mamykina <ref type="bibr">[33]</ref> demonstrated that providing AI-generated recommendations and explanations did not improve people's decision-making abilities, even when those explanations contained facts from which individuals could learn and improve their decision-making abilities.</p><p>Some have argued that when people are provided with AI recommendations they may overrely on the recommendation and only super!cially process the explanation <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>, thus failing to improve their learning and skills on the task. <ref type="bibr">[33]</ref>. Building on this, we posit that even when people attend to the explanation, the reason AI systems fail to improve people's skills, can be partially attributed to the nature of the explanations provided, which often fail to address the speci!c knowledge gaps that users seek to !ll. Typically, these systems o"er, what we call, unilateral explanations -justi!cations that focus on why a particular AI recommendation was made, often by detailing the relevant features <ref type="bibr">[79]</ref>, highlighting regions <ref type="bibr">[87]</ref>, or presenting the general reasoning in support of the decision. For</p><p>The AI suggests Exercise A because of: I would have chosen Exercise B Why A rather than B?</p><p>Unilateral explanation a While both exercises are suited for this person's abilities, preferences, and equipment, exercise A is more effective for achieving their goal of muscle building than exercise B.</p><p>The AI suggests Exercise A instead of the commonly chosen Exercise B becuase: Contrastive explanation b I was thinking of Exercise B, but I see why Exercise A makes more sense!</p><p>&#8226; the person's goal of muscle toning, which requires exercises that build lean mass &#8226; the person's current fitness level, which indicates they are ready for moderately intense activities &#8226; the availability of equipment &#8226; the person's preference for indoor activities Figure <ref type="figure">1</ref>: A simpli!ed illustration of (a) unilateral explanations, which list all the features contributing to the AI's decision, and (b) contrastive explanations, which highlight the di"erences between the AI's choice and a likely human response for an exercise recommendation task.</p><p>example, an AI system might recommend treatment P as the best option for a patient because it addresses symptoms X, Y, and Z, without contrasting it to alternative treatments that may address some of those symptoms as well. Yet, research in social science and cognitive psychology has put forth that people naturally seek explanations that are contrastive rather than unilateral <ref type="bibr">[43,</ref><ref type="bibr">62,</ref><ref type="bibr">70]</ref>. When people ask why a certain event occurred, or a certain choice was made -"Why P?"-they are often implicitly asking "Why P (referred to as fact) rather than Q (referred to as foil)?" -where foil is a plausible alternative that was considered but not chosen. Jacobs et al. <ref type="bibr">[46]</ref> found that clinicians would prefer contrastive explanations from AI decision support systems as well, which, for example compare and contrast the AI suggestions with existing standards of care. Such explanations are intuitive and engaging because they focus only on the knowledge gap, addressing the speci!c points of divergence that are of interest or confusion to the explainee.</p><p>While numerous computational approaches have been introduced for generating contrastive explanations <ref type="bibr">[1,</ref><ref type="bibr">3,</ref><ref type="bibr">48,</ref><ref type="bibr">95]</ref>, their focus typically lies on computing the contrast between the fact and the foil rather than on determining a high quality foil. Some approaches consider the foil to be the closest class to the fact in the model <ref type="bibr">[95]</ref>, which we argue results in a model-centric foil that does not necessarily align with human reasoning about the task. Others assume the foil is provided or explicitly inputted by the user <ref type="bibr">[48]</ref>. Such additional step of making initial decisions has been shown to a"ect both the acceptance of systems and subjective experience in AI-assisted decision-making <ref type="bibr">[11,</ref><ref type="bibr">32]</ref>. Thus, predicting a human-centered foil without asking for explicit user input remains a challenge for generating useful contrastive explanations.</p><p>Building on the existing insights into AI-assisted decision-making, in this paper, we propose a novel approach to enhance AI-powered decision-support systems by generating human-centric contrastive explanations. Our method leverages a "mental model" of humans to provide an explanation in the form of "Why P rather than Q?"where P represents the AI's recommendation and Q is a plausible alternative response from a human perspective. Unlike unilateral explanations, which justify a recommendation by listing all dimensions that contributed to a decision, our contrastive explanations focus on the distinctions between the AI's suggestion and a likely human response, while highlighting only the dimensions in which the two choices di"er. We hypothesize that such contrastive explanations, which highlight knowledge gaps between AI and predicted human responses, will foster greater cognitive engagement and enhance task learning compared to unilateral explanations, while maintaining similar decision accuracy. Additionally, we explore how the quality of the foil (predicted vs. random) and timing (before the person makes a decision vs. after an initial decision) of the contrastive explanation a"ect their e"ectiveness, hypothesizing that high-quality foils will maximize learning outcomes and pre-decision timing will maximize acceptance.</p><p>To generate contrastive explanations with which to test the hypotheses, we introduce a human-centric framework, which we instantiated for an exercise recommendation decision-making task. Our framework consists of four modules: <ref type="bibr">(1)</ref> an AI task model that predicts a response to a decision task (fact), (2) a human model that predicts an average human's response for the same task (foil), (3) a contrast module that identi!es the relevant dimensions where the fact and foil di"er, and (4) an LLM-powered presentation module that formats these di"erences into an explanation and adds common sense knowledge (within the constraints provided by the other modules) that bridges the knowledge gap between the AI's recommendation and the human response. To test our hypotheses, we conducted an online between-subjects experiment (N=628) comparing !ve conditions: no AI, unilateral explanations, contrastive explanations with a predicted foil, contrastive explanations with a random foil, and contrastive explanations provided after an initial decision was made (inputted foil). Our results demonstrated that contrastive explanations with a predicted foil enhanced human skill on the task (i.e., human learning) signi!cantly more than unilateral explanations, without sacri!cing decision accuracy. Within contrastive conditions, we found that timing of contrastive explanations a"ected subjective experience but not objective outcomes. Participants in the contrastive explanations with predicted vs. inputted foil did not di"er signi!cantly in terms of decision accuracy or human learning but contrastive explanations with predicted foils resulted in signi!cantly higher subjective perceptions of competence, autonomy, and relatedness to the AI than contrastive explanations with inputted foils. Further, we found that the quality of the foil matters: although we used a single model to predict human responses, participants interacting with contrastive explanations featuring a predicted foil improved their learning more than those with a random foil, though the di"erence was only marginally significant. This result suggests that personalized foil models, !ne-tuned for each individual, may o"er additional bene!ts.</p><p>Our study explored only the short-term e"ects of explanations on learning. Further research is needed to understand whether these positive e"ects persist over time.</p><p>In summary, this paper makes the following main contributions:</p><p>&#8226; We introduced a contrastive explanations framework for generating human-centered contrastive explanations which compare AI's decision choice to a predicted human response for the same task. &#8226; Our results demonstrated that such human-centered contrastive explanations signi!cantly enhance decision-making skills without sacri!cing decision accuracy compared to unilateral explanations, a default explanation design in AIpowered decision support. &#8226; We further presented evidence about which design aspects of contrastive explanations a"ect objective outcomes and people's intrinsic motivation to engage with the decision task. &#8226; Our work is the !rst to demonstrate that the content of explanations signi!cantly impacts the improvement of human skills, opening up new opportunities for developing more e"ective explanation designs. &#8226; Our research suggests that decision support tools that consider the decision-makers' knowledge and mental model of the task can enhance people's understanding and pro!ciency in the task more e"ectively than current designs of decisionsupport which provide AI-centric unilateral explanations.</p><p>With the growing adoption of AI-powered support across tasks and settings, we believe that our !ndings may o"er a path forward toward AI systems that upskill, augment, and improve human capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK 2.1 Contrastive Explanations</head><p>The !eld of Explainable AI (XAI) has developed a wide range of methods aimed at making AI systems more understandable and useful to people <ref type="bibr">[38]</ref>. Seminal approaches include featurebased explanations like LIME <ref type="bibr">[79]</ref> and SHAP <ref type="bibr">[65]</ref>, which demonstrate how individual features in#uence an AI decision, as well as saliency maps <ref type="bibr">[87]</ref>, which highlight image regions that contributed to the outcome. These methods, which we refer to as unilateral approaches, focus on explaining why the AI made a speci!c decision but do so in isolation, without explicitly comparing it to other plausible alternatives. Meanwhile, Miller <ref type="bibr">[70]</ref>'s extensive review of social science research has underscored the signi!cance of contrastive explanations, sparking a new line of inquiry in ML and HCI research. Miller's review highlights that, according to social science literature, explanations people seek and provide are predominantly contrastive <ref type="bibr">[59,</ref><ref type="bibr">62,</ref><ref type="bibr">70]</ref>. Rather than simply asking "Why P?" to receive a list of features or a sequence of causal events, people often want to know "Why P instead of Q?" -seeking an explanation that clari!es the di"erence between the actual outcome and an (often implicit) alternative they expected. Lipton <ref type="bibr">[62]</ref> refers to "P", the actual event, as the fact, and the alternative "Q" as the foil. Social science experts emphasize the value of contrastive explanations for two main reasons <ref type="bibr">[71]</ref>. Firstly, they arise from a person's surprise over an unexpected event, revealing their preconceived expectationsessentially o"ering insight into the individual's mental model and highlighting their knowledge gaps <ref type="bibr">[59,</ref><ref type="bibr">69]</ref>. Secondly, providing and asking for contrastive explanations is less complex and cognitively demanding, making the process more e$cient for both the inquirer and the respondent <ref type="bibr">[59,</ref><ref type="bibr">62,</ref><ref type="bibr">103]</ref>. In AI-assisted decision-making, we further hypothesize that because contrastive explanations highlight (1) the knowledge gap of the inquirer and (2) are shorter, and thus easier to parse, they will result in improved knowledge acquisition from the decision-maker compared to explanations that highlight all the decision factors.</p><p>In recent years, machine learning scholars have introduced various computational approaches for generating contrastive explanations, such as pairwise class comparisons <ref type="bibr">[1,</ref><ref type="bibr">3]</ref>, tree-based methods <ref type="bibr">[89,</ref><ref type="bibr">95]</ref>, or identifying pertinent positives and negatives <ref type="bibr">[29]</ref>. Unlike in our work in which the foil seeks to convey explainee's thinking and is generated by a separate model, in the existing techniques, the foil is commonly determined as the closest alternative outcome that would alter the model's decision. For example, in treebased approaches, foils are selected as the closest non-matching class leaf, while in counterfactual reasoning Hendricks et al. <ref type="bibr">[41]</ref>, foils are chosen based on their proximity to the input data but belonging to a di"erent class. It is not clear, however, if these methods correctly anticipate how people and the models disagree. We argue that contrastive explanations in which the foil presents the decision-maker's reasoning more accurately re#ect the social science understanding of contrastive reasoning, which seek to clarify the gaps in the explainee's reasoning.</p><p>One example of contrastive explanations in HCI literature is Zhang et al. <ref type="bibr">[104]</ref>'s framework for generating contrastive explanations in vocal emotion recognition. Like other machine learning techniques, this framework highlights the di"erences between two similar instances with di"erent class labels, using high-level, human-interpretable concepts rather than granular features. Other related systems produce counterfactual explanations which compare decision instances or hypothetical input space changes <ref type="bibr">[51,</ref><ref type="bibr">99]</ref>, rather than outcome di"erences. It is important to note that contrastive explanations are often con#ated with counterfactual explanations, which explore how minimal input changes could lead to di"erent outcomes, while contrastive explanations clarify di"erences between two outcomes (e.g., "Why treatment P rather than Q?"). In multi-class settings, these explanations (contrastive and counterfactual) are distinct, but for binary classi!cations tasks, the distinction blurs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">AI-Assisted Decision-Making</head><p>2.2.1 Why optimizing human decision-making skills in AI-assisted decision-making ma!ers? A growing concern, especially with the recent developments in generative AI <ref type="bibr">[60,</ref><ref type="bibr">100]</ref>, deskilling refers to the process by which workers lose skills or their pro!ciency in tasks due to a reduced need to actively engage in those tasks <ref type="bibr">[9]</ref>. This often occurs when technology, such as AI and automation <ref type="bibr">[91]</ref>, takes over some or all responsibilities that were previously performed by humans. As individuals become more reliant on these systems to handle complex or repetitive tasks, they may stop developing or maintaining the expertise required to perform those tasks independently <ref type="bibr">[4]</ref>. For example, in AI-assisted decision-making, workers might depend on AI to make recommendations or decisions, which can diminish their critical thinking, problem-solving abilities, and overall competence in that domain over time. Indeed, recent empirical evidence shows that the current designs of decision-support tools that provide AI recommendations and explanations do not seem to support people's growth of decision-making skills <ref type="bibr">[33]</ref> and evidence from expert-based systems shows that long-term dependence on such systems does lead to deskilling in those very tasks <ref type="bibr">[80]</ref>.</p><p>While powerful, AI systems can be wrong. They make errors due to biases in the data, limitations in the model, or unforeseen circumstances and they even hallucinate. In the short term, when humans have strong decision-making skills, they are better equipped to recognize and override AI mistakes, can critically assess the AI's recommendations, apply domain expertise, and contribute meaningfully to the decision-making process, resulting in more accurate and nuanced outcomes. In the long term, nurturing independent and strong decision-making skills is essential for humans to retain autonomy in decision-making, transfer their expertise to new situations, and adapt to evolving technologies. Such independent decision-making protects both accountability and human agency as AI becomes more integrated into work#ows.</p><p>Our work adds to the nascent body of research in AI-assisted decision-making, which is concerned with improving human decisionmaking skills in addition to accuracy of the decisions <ref type="bibr">[12,</ref><ref type="bibr">33]</ref>. While related, AI in education is a distinct !eld that focuses primarily on utilizing AI to enhance students' learning outcomes and skill development. A notable recent advancement in scalable tutoring involves leveraging large language models (LLMs) as virtual tutors. These models infer student errors and address knowledge gaps by emulating strategies employed by expert tutors <ref type="bibr">[96]</ref>. While some of these approaches may be applicable to decision-making contexts, we believe educational interventions operate under a di"erent set of constraints. For instance, introducing substantial friction in interactions (e.g., as explored in <ref type="bibr">[53]</ref>) is often acceptable in education but may not be suitable for real world decision-making scenarios. For a broader overview of related work of AI in education, we direct readers to a recent review <ref type="bibr">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2">Eliciting cognitive engagement to calibrate reliance on AI.</head><p>Early optimism that AI decision-support tools will inevitably enhance human decision quality <ref type="bibr">[67]</ref> has dwindled in light of accruing empirical evidence that paints a more complex picture <ref type="bibr">[5,</ref><ref type="bibr">11,</ref><ref type="bibr">35,</ref><ref type="bibr">82,</ref><ref type="bibr">98]</ref>. Intuitive designs that rely on simple XAI approaches, such as providing AI recommendations alongside (unilateral) explanations, have been shown to lead to overreliance -where users follow incorrect AI advice -across diverse tasks, settings, and explanation styles <ref type="bibr">[5,</ref><ref type="bibr">11,</ref><ref type="bibr">34,</ref><ref type="bibr">47,</ref><ref type="bibr">82]</ref>. This empirical evidence has prompted extensive research in designing interventions beyond explanation content that encourage appropriate human reliance on AI. Some endeavours to addressing this challenge focus on training or onboarding sessions aimed at helping individuals develop a mental model of the AI <ref type="bibr">[19,</ref><ref type="bibr">20,</ref><ref type="bibr">52,</ref><ref type="bibr">73,</ref><ref type="bibr">74,</ref><ref type="bibr">78]</ref>, providing meta-information about the AI's uncertainty and limitations <ref type="bibr">[16,</ref><ref type="bibr">55,</ref><ref type="bibr">73]</ref>, helping individuals calibrate their own self-con!dence about the task <ref type="bibr">[40,</ref><ref type="bibr">66]</ref>, or enhance user agency by giving them control over input feature selection and algorithmic assistance <ref type="bibr">[22,</ref><ref type="bibr">56]</ref>.</p><p>By prompting users to re#ect on two choices, contrastive explanations fall within one such growing category of interventions designed to compel deeper cognitive engagement with AI support. Scholars have suggested that overreliance on AI often stems from people's super!cial engagement with AI recommendations and explanations <ref type="bibr">[11,</ref><ref type="bibr">33]</ref>. In response, various interventions have been developed to enhance cognitive engagement, including cognitive forcing <ref type="bibr">[11]</ref>, evaluative AI <ref type="bibr">[72]</ref>, explanations provided without decision recommendations <ref type="bibr">[33]</ref>, explanations framed as questions <ref type="bibr">[26]</ref>, or o"ering more than one decision suggestion <ref type="bibr">[23,</ref><ref type="bibr">64]</ref>.</p><p>While many of these interventions have shown promise in human-AI decision-making quality, they often introduce trade-o"s, such as reducing subjective experience <ref type="bibr">[11]</ref> or requiring more time <ref type="bibr">[21,</ref><ref type="bibr">92,</ref><ref type="bibr">93]</ref>, compared to simply providing AI recommendations with explanations. For instance, cognitive forcing functions <ref type="bibr">[11]</ref> compel deeper cognitive engagement by requiring people to make decisions before receiving AI support. While these interventions signi!cantly reduce overreliance compared to presenting AI recommendations and explanations upfront, they also lead to signi!cantly lower subjective experience. Empirical evidence suggests that people generally tend to dislike receiving AI support after having made a decision <ref type="bibr">[11,</ref><ref type="bibr">32]</ref>. Evaluative AI also introduces a paradigm where decision-makers form provisional decisions and then receive AIgenerated critiques <ref type="bibr">[72]</ref>. Albeit, a recent study implemented this concept as an interactive interface that allows users to iteratively re!ne their hypotheses and observe how the evidence aligns or con#icts with their reasoning <ref type="bibr">[57]</ref>, which appears to preserve the subjective experience compared to receiving static critiques posthoc. Building on this evidence, we hypothesize that providing static contrastive explanations after prompting a person to make an initial decision may similarly have a negative e"ect on subjective experience, thus hindering the uptake of systems that provide such support in real-life scenarios. However, there is a trade-o" here because providing a contrastive explanation after a person has revealed their initial decision has the obvious advantage of revealing the actual foil to the system. This, in turn, can make the explanations more useful for human decision-making and learning compared to settings where the foils are imperfectly predicted.</p><p>O"ering more than one decision suggestion or source of advice has also been explored as a mechanism to enhance engagement and calibrate reliance on AI support <ref type="bibr">[5,</ref><ref type="bibr">23,</ref><ref type="bibr">64]</ref>. For example, Bansal et al. <ref type="bibr">[5]</ref> show that providing top two AI predictions and Lu et al. <ref type="bibr">[64]</ref> show that o"ering a "second opinion" in addition to the main AI support, either from another AI or peers, can reduce overreliance on AI recommendations in certain situations. Contrastive explanations also make two options salient to the decision-maker -the fact and foil -along with reasoning that supports one over the other. It is unclear whether the presence of the explicit comparison in contrastive explanations might dilute the "second opinion" e"ect that previous studies have shown reduces overreliance.</p><p>Finally, the studies mentioned above treat cognitive engagement as a mechanism for fostering appropriate reliance on AI, mostly focusing on optimizing human-AI decision accuracy by encouraging deeper thought about AI recommendations. Building on the work of Gajos and Mamykina <ref type="bibr">[33]</ref>, our study instead examines cognitive engagement as a means of enhancing human learning about the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3">Assisting decision-making with LLM-generated explanations.</head><p>The emergence of Large Language Models (LLMs) has sparked interest in their potential to generate explanations that enhance decision-making. In the domain of programming assistants, Yan et al. <ref type="bibr">[102]</ref> introduced an LLM-powered system that generates natural language explanations to clarify the functionality of each code suggestion. For data annotation tasks, Wang et al. <ref type="bibr">[97]</ref> leveraged an LLM to predict annotation labels and provide explanations for its choices. In recommendation systems, Silva et al. <ref type="bibr">[86]</ref> used LLMs both as the recommendation engine and as a generator of personalized explanations to improve user experience. In the context of LLM-powered fact-checking, providing contrastive explanations that present evidence on both why a claim might be true and why it might be false has been shown to e"ectively reduce overreliance on incorrect LLM predictions <ref type="bibr">[85]</ref>.</p><p>In contrast to such approaches, which use LLMs for both taskcentric predictions (e.g., code suggestions, recommendations) and explanation, our work separates these functions. Like Slack et al. <ref type="bibr">[88]</ref>, who introduce a chat-based interface to query a predictive machine learning model, we exploit LLMs for their language generation capabilities, and in addition for their common sense reasoning. We rely on a trusted predictive model for generating task-related predictions and explanation dimensions, while the LLM is solely responsible for turning a sca"old produced by the predictive model into natural language, and !lling in small common-sense knowledge gaps needed to interpret the model's predictions. This separation preserves the accuracy of predictions and explanations while bene!ting from the interpretability and coherence of LLM-generated rationalizations, thereby minimizing the risk of the notorious hallucinations for which LLMs are known <ref type="bibr">[50]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Human Intrinsic Motivation and AI Assistance</head><p>With AI systems rede!ning work#ows and the way tasks are carried out, questions surrounding their e"ect on people's motivation about the tasks for which they receive assistance are becoming more pressing <ref type="bibr">[12]</ref>. According to the seminal Self-Determination Theory (SDT), individuals feel intrinsically motivated when three psychological needs-competence, autonomy, and relatedness-are met during an activity <ref type="bibr">[27]</ref>. Competence refers to the need to feel skilled and e"ective in the activity, autonomy re#ects the need to have control over how the activity is carried out, and relatedness involves the need to feel connected to others and to experience a sense of belonging while engaging in the activity. These three needs are fundamental for fostering intrinsic motivation, which leads to greater engagement, performance, and overall satisfaction with the task <ref type="bibr">[27]</ref>. The introduction of AI assistance into decisionmaking processes can a"ect these psychological needs in multiple ways. For example, while AI might enhance short-term feelings of competence by providing support in the moment of decisionmaking <ref type="bibr">[31]</ref>, it may simultaneously undermine long-term mastery, as current designs do not always facilitate skill development <ref type="bibr">[33]</ref>.</p><p>Similarly, AI can diminish a user's sense of autonomy if they feel overly dependent on the system, reducing their ownership of task outcomes.</p><p>We hypothesize that both the outcomes of the interaction and the design of the AI system in#uence perceptions of competence and autonomy. On the outcome side, we expect that AI systems that actively support skill development will enhance feelings of competence. In terms of design, approaches where the AI critiques each decision after it is made (e.g., contrastive after) may undermine users' feelings of competence and autonomy, as they could perceive the AI more as a micromanager than a supportive tool, constantly pointing out #aws and dictating its preferred way of doing things. Additionally, even when AI assistance is provided before a decision, designs that emphasize only one option (e.g., unilateral condition) may reduce the sense of autonomy by shifting the decision-maker into a more passive role. In contrast, designs that present multiple options promote active engagement, allowing the decision-maker to weigh di"erent possibilities before making a choice (e.g., contrastive before conditions).</p><p>In SDT, relatedness refers to the connection an individual feels toward colleagues or collaborators, typically measured through questions about trust, similarity in reasoning, and willingness to engage in future interactions. We adapt these constructs to assess relatedness to AI, hypothesizing that designs fostering competence and autonomy will similarly enhance relatedness to AI systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">THE CONTRASTIVE EXPLANATION FRAMEWORK &amp; HYPOTHESES</head><p>Imagine a clinician reviewing an AI-powered decision-support system's recommendation for a patient's treatment plan. The AI suggests Medication A, but the clinician had Medication B in mind based on their experience with this condition. Existing AI systems would simply explain why Medication A is recommended. However, this leaves the clinician wondering why Medication B, which they deemed suitable, is not the better choice. A contrastive explanation may elucidate this knowledge gap as follows: "While Medication B is a common and a viable choice for most patients because of its short treatment duration, Medication A is recommended due to its lower risk of drug interactions with this patient's current medications. "</p><p>We propose the Contrastive Explanation Framework to address the limitations of current AI-powered decision-support systems by providing contrastive explanations that acknowledge human's alternative considerations when suggesting a decision. This framework is composed of four main components: (1) an AI task model, (2) a model of how humans are likely to reason about this task, (3) a contrastive module, and (4) a presentation module. The AI task model is the standard AI system that predicts the AI's response for a given decision task (fact), while the human model predicts an average user's response -a plausible alternative (foil) -for the same task based on a model trained on previous human decisions. Based on AI model (e.g., weights), the contrastive module then analyzes the di"erences between the AI's and the likely human's responses, generating task concepts in which the fact is superior to The AI task model predicts the AI's response for a given decision task (fact), while the human model predicts the user's response for the same task (foil). The contrastive module then analyzes the di"erences between the AI's and the human's responses, generating task dimensions where the fact is superior to the foil (&#119871; fact ) and, if any, where the foil is superior to the fact (&#119871; foil ). Finally, the presentation module, powered by a large language model (LLM), formats the information into an interpretable explanation, !lling in small common-sense knowledge gaps within the constraints of the predictive models. The example generation outlined in the !gure is relates to the character vignette in Figure <ref type="figure">3</ref>.</p><p>. the foil (e.g., lower risk of drug interaction) and task concepts, if any, in which the foil is superior to the fact (e.g., shorter treatment duration). Finally, the presentation module, powered by a large language model, formats these dimensions into prose and !lls in the common sense knowledge that focuses on the knowledge gap that may lead someone to pick foil as opposed to fact.</p><p>To evaluate the e"ectiveness of contrastive explanations in improving human learning and accuracy in AI-assisted decision-making, we instantiated this framework with an exercise recommendation task, and conducted an experiment in which people were asked to complete a sequence of decisions and were randomized in one of the 5 di"erent conditions:</p><p>&#8226; No AI (Baseline). Participants in the No AI condition completed the study without any AI support. &#8226; Unilateral In this condition, participants interacted with the typical AI recommendation and explanation paradigm. The AI suggested a choice and provided reasoning to justify why that choice was the best one. The explanation was unilateral, emphasizing all the concepts and evidence supporting the AI's suggestion. &#8226; Contrastive predicted (with predicted foil). The contrastive predicted condition was designed to provide participants with a contrastive explanation that compares the AI recommendation (fact) with the alternative (foil) that a human may likely consider, as predicted by the human model.</p><p>In the interface, we presented the foil as a choice that "many people" would likely make in a similar situation. The explanation highlighted only the concept(s) in which the two choices di"er, emphasizing why the AI's recommendation is superior to the foil. &#8226; Contrastive random (with random foil). Presentationwise, the contrastive random condition was identical to the contrastive condition. However, in this case, the foil was selected randomly from the six possible choices rather than being predicted by the human model.</p><p>&#8226; Contrastive after (with inputted foil). In the contrastive after condition, participants !rst made their own decision before receiving the AI's recommendation and the contrastive explanation, in which participant's decision was used as the foil. In situations when the inputted foil was the same as the AI suggestion, participants were presented with a unilateral explanation supporting their choice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Hypotheses and Research Questions</head><p>In our hypotheses, we sometimes refer jointly to contrastive predicted and contrastive after conditions, in which the foil is not random, as contrastive with a sensible foil.</p><p>Our main hypotheses are that contrastive explanations with a sensible foil will improve participants' decision-making skills<ref type="foot">foot_0</ref> more e"ectively and result in accuracy that is equal to or better than unilateral explanations. Furthermore, within the contrastive conditions, we hypothesize that contrastive explanations with a predicted foil will result in greater human learning than those with a random foil, and o"er a superior subjective experience compared to contrastive explanations with an inputted foil.</p><p>We categorize these main and other hypotheses and research questions by interaction outcomes -human learning, accuracy, and subjective experience -and elaborate them below. To enhance readability, we abbreviate learning-focused hypotheses and research questions as H-L and RQ-L and accuracy-focused ones as H-A and RQ-A, respectively. For hypotheses related to subjective measures, we use the -S su$x (e.g., H-S1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">Human Learning.</head><p>H-L1: Contrastive explanations with sensible foil -predicted (H-L1a) or inputted (H-L1b) -will lead to more learning than providing people with no AI support.</p><p>H-L2: Contrastive explanations with sensible foil -predicted (H-L2a) or inputted (H-L2b) -will lead to more learning than unilateral explanations. H-L3: Contrastive explanations with predicted foil will lead to more learning than contrastive explanations with a random foil. RQ-L1: Will contrastive explanations with predicted foil (provided at the decision-making time) lead to di"erent learning than contrastive explanations after the decision is made (contrastive after)?</p><p>H-A1: Contrastive explanations with sensible foil -predicted (H-A1a) or inputted (H-A1b) -will lead to equal or better decision accuracy compared to unilateral explanations. RQ-A1: Will contrastive explanations -predicted or randomwhich present two choices, reduce overreliance on AI, compared to unilateral explanations?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.3">Subjective Experience.</head><p>H-S1: Contrastive explanations with predicted foil will lead to higher perceived competence, autonomy, and relatedness to AI than unilateral explanations. H-S2: Contrastive explanations with predicted foil will lead to higher perceived competence, autonomy, and relatedness to AI than contrastive explanations with inputted foil.</p><p>In the following sections, we describe an exercise recommendation task and the instantiation and implementation of the contrastive explanations framework for the exercise recommendation task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXERCISE RECOMMENDATION TASK DESIGN</head><p>To create a decision-making task accessible to laypeople on crowdsourcing platforms while presenting cognitive challenges similar to high-stakes decisions (e.g., treatment selection), we collaborated with a kinesiology expert, a co-author of this paper. We designed scenarios for an exercise recommendation task, as shown in Figure <ref type="figure">3</ref>. Participants are tasked to choose the best exercise from a list of options based on a !ctional character's description, goals, and preferences. This task is designed to be easy to understand yet complex enough to mimic clinical treatment decisions. Clinicians consider many (sometimes competing) factors when selecting treatments, such as patient condition, treatment preferences, side-e"ect tolerance, and constraints. Similarly, selecting the right exercise involves weighing the individual's goals, preferences, and capabilities, requiring analogous cognitive steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Generating the !ctitious characters</head><p>We generated vignettes of !ctitious people by randomly sampling their demographics from probabilities obtained from the US Census<ref type="foot">foot_1</ref> , Centers for Disease Control and Prevention<ref type="foot">foot_2</ref> , and the US Bureau of Labor statistics<ref type="foot">foot_3</ref> (name, age, gender, BMI, physical activity level, occupation). According to the sampled !ctitious character, we manipulated or randomly sampled the following factors which were deemed important for exercise prescription by the expert: (1) their !tness level and maximal intensity (based on demographics and health status), (2) their exercise goal (e.g., building muscles, weight loss, #exibility), and (3) their exercise preference (e.g., indoor/outdoor, group/individual). We implemented these steps as !ctitious character generation process that allowed us to generate di"erent characters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Curating the exercises</head><p>To build an exercise repository for recommending activities to !ctional individuals, we curated a list of 59 leisure activities from a comprehensive compendium, which included various physical activities, from sports to everyday tasks like housework and occupational activities <ref type="bibr">[2]</ref>. In the compendium, each activity was labeled with its MET (metabolic equivalent), which denotes the energy requirement for basal homeostasis (1 MET is roughly the energy required to sleep or watch TV). Moderate activities require between 3 and less than 6 METs, while vigorous activities require 6 or more METs. We also labeled the exercises based on (i) their goals (cardio, muscle building, #exibility), (ii) whether they are typically performed indoors or outdoors, and (iii) whether they are typically performed individually or in a group. From this list, we selected seven representative exercises for the dropdown menu: aerobics, bicycling, boxing, jog/walk combination, pilates, resistance training, and swimming. See Appendix A.1.2 for a detailed description of the selection process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Representing characters and exercises</head><p>To prescribe exercises to characters, we !rst represented exercises and characters in a joint representation space. Guided by the domain expert, we constructed a relatively simple representation space consisting of three broad concepts: (1) intensity, (2) goal, and (3) preference. Each exercise and generated character was encoded onto these three broad concepts as described below.</p><p>Intensity. For exercises, intensity captures the level of exertion or e"ort the exercise requires, measured in METs. One MET is de!ned as the oxygen consumption of 3.5 milliliters of oxygen per kilogram of body weight per minute (3.5 ml/kg/min), which is roughly the rate of oxygen consumption at rest.</p><p>For characters, intensity captures the level of exertion or e"ort a character can sustain during physical activity (i.e., their cardiorespiratory !tness). It is quanti!ed by the reserve oxygen uptake (&#119872;&#119873; 2 &#119871; ), which represents the additional oxygen consumption capacity a person has beyond their resting state. This reserve is determined by subtracting the resting oxygen uptake (3.5 ml/kg/min or 1 MET) from the maximal oxygen uptake (&#119872;&#119873; 2 &#119872;&#119873;&#119874; ), which is the highest rate at which the body can use oxygen during intense physical activity. Maximal oxygen uptake is assessed in clinical settings using a treadmill test, but various equations have been proposed as useful proxies <ref type="bibr">[77]</ref>. Following Jang et al. <ref type="bibr">[49]</ref>, we calculated the cardiorespiratory !tness of a character as a function of age, sex, BMI, and current physical activity level <ref type="foot">5</ref> (rating of physical activity on a 7-point scale <ref type="bibr">[45]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Find an optimal exercise for Natalie</head><p>By taking into account the information below determine the exercise that is better suited for Natalie.</p><p>Natalie wants to build muscles and increase flexibility. She is 62 years old and is retired. Her current engagement in physical activity is low. She is not overweight. She enjoys group and indoor activities. (Assume Natalie has the necessary resources to perform any of the exercises.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I would suggest</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Many people's choice</head><p>Many people would choose resistance training for Natalie.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI suggestion</head><p>The AI suggests pilates as the optimal exercise for Natalie.</p><p>While resistance training may provide better muscle building results, pilates proves to be more beneficial in terms of elevating overall flexibility.</p><p>&#8226; Goal flexibility: Pilates is an exercise that focuses heavily on improving flexibility by elongating, toning muscles and increasing muscle elasticity, which contrasts with resistance training that focuses primarily on muscle building. This core characteristic of pilates is ideally suited for Natalie who aims to both build muscles and enhance her flexibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI explanation</head><p>(a) Sample task with contrastive explanation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI suggestion</head><p>The AI suggests pilates as the optimal exercise for Natalie.</p><p>&#8226; Intensity: Pilates exercises are low to moderate in intensity, making them suitable for Natalie's current fitness level. &#8226; Goal: Through practices of controlled movements and poses, Pilates significantly helps in building muscle strength and increasing flexibility, aligning with Natalie's fitness goals. &#8226; Preference: Pilates is usually practiced in a group setting, which satisfies Natalie's preference for group activities. It can also be performed indoors, fitting her enjoyment of indoor activities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI explanation</head><p>(b) Unilateral explanation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Your choice</head><p>You chose resistance training for Natalie.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI suggestion</head><p>The AI suggests pilates as the optimal exercise for Natalie.</p><p>While resistance training may provide better muscle building results, pilates proves to be more beneficial in terms of elevating overall flexibility.</p><p>&#8226; Goal flexibility: Pilates is an exercise that focuses heavily on improving flexibility by elongating, toning muscles and increasing muscle elasticity, which contrasts with resistance training that focuses primarily on muscle building. This core characteristic of pilates is ideally suited for Natalie who aims to both build muscles and enhance her flexibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AI explanation</head><p>(c) Contrastive explanation after Figure <ref type="figure">3</ref>: Illustration of the exercise recommendation decision-making task featuring di"erent explanation designs. 3a shows a sample of the task with contrastive explanation, whereas 3b and 3c depict only the explanations for the respective conditions.</p><p>In the contrastive random condition, the presentation was identical to the contrastive condition, but with the alternative (foil) selected randomly. In the no-AI condition (not illustrated), participants made decisions without any AI assistance.</p><p>Goal. For exercises, goal captures the type of bene!t the exercise has on the body, consisting of three dimensions: cardiovascular improvement, muscle building, or #exibility. For characters, goal re#ects what the character aims to achieve through exercise, in terms of the same three dimensions: improving cardiovascular health, building muscle, or enhancing #exibility. Note that additional domain knowledge (e.g., cardio is bene!cial for weight loss) is necessary to convert some of the higher level character's exercise goals (e.g., losing weight) to the representation space.</p><p>Preference. For exercises, preference indicates whether the exercise is typically performed indoors or outdoors, and whether it is usually done individually or in a group. For characters, preference captures the character's preferred exercise environment (indoor/outdoor) and social setting (individual/group).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Designing the objective function</head><p>Having constructed joint representations for characters and exercises, we now formalize our setting and explain the objective function we designed for recommending exercises to characters.</p><p>Let a !ctitious character representation be x &#8595; R &#119887; and an exercise representation be y &#8595; R &#119887; , where &#119874; = 6 and both representations are structured with dimensions representing intensity, goals, and preferences:</p><p>x = [&#119875; MET , &#119875; cardio , &#119875; muscle , &#119875; #exibility , &#119875; environment , &#119875; social setting ] &#119888; , with y following a similar structure. Our goal was to create a function that scores the "goodness" of an exercise for the given character. We designed a linear objective function:</p><p>where g(x, y) is a piece-wise vector-valued function (devised with the expert) that returns a joint representation (vector) of the person and the exercise for each dimension. g(x, y) &#8595; R &#119887;+1 concerning the following aspects: intensity, goal, and preference.</p><p>Intensity: Penalize exercises exceeding character's capabilities.</p><p>min(0, &#119877; 1 &#8593; &#119875; 1 )</p><p>Intensity: Penalize exercises underutilizing character's capabilities.</p><p>[</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>,4}</head><p>Goal: Match each stated subgoal (cardio, muscle building, !exibility).</p><p>[&#119877; &#119889; == &#119875; &#119889; ] &#119889; &#8595; {5,6}</p><p>Preference: Match each preference (environment, social setting).</p><p>In the equation above, the subscripts refer to dimensions of x and y, and denotes the indicator function, which takes the value 1 if the condition inside the brackets is true and 0 otherwise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Learning the expert weights</head><p>The parameterized objective function (equation 1) enables learning weights w from di"erent sources of labels. We aim to learn &#119876; expert with weights w e based on expert labels, and &#119876; human with weights w h based on crowdworker labels. The expert model &#119876; expert , takes a description of a !ctitious character and exercise, and outputs a realvalued score indicating how well the exercise matches the goals, abilities and preferences of the !ctitious character. We trained and validated this model on expert labels of optimal exercises for a series of characters. Similarly, the human model, which captures how humans reason on average (described in 5.2), was trained on crowdworkers' (i.e., laypeople's) labels.</p><p>Generating a series of diverse !ctitious characters, we asked the kinesiology expert to select top exercises for them from a list of top 15 exercises (out of 59) selected with a "dummy" scoring function which equally weighted each dimension. For every character, the expert provided a best set of exercises S 1 typically consisting of two or three similar exercises (e.g., pilates, yoga), and a second best set of exercises S 2 that would still be a reasonable choice but not as good as the !rst set. With 15 exercises in the list, these labels provided multiple pairwise comparisons between the individual exercises. For every exercise y &#119890; &#8595; S 1 , then y &#119890; is a better choice than y &#119891; for every other exercise y &#119891; &#969; S 1 . Similarly, for every exercise y &#119890; &#8595; S 2 , then y &#119890; is a better choice than y &#119891; for every</p><p>With a dataset of rankings, our goal was to learn the expert weights w &#119875; from equation 1. Ranking problems, particularly with linear ranking functions, can be transformed into classi!cation problems by considering pairwise di"erences between elements <ref type="bibr">[42]</ref>. This approach involves transforming the ranking task of a set of items (e.g., exercises) into several binary classi!cation tasks. For each item pair, a di"erence vector of their features (u &#119890; &#8593; u &#119891; ) is generated and the label corresponds to their relative order (e.g., the label &#119878; = 1 if item &#119879; is a better choice than item &#119880; and &#8593;1 otherwise). In our setting, the items correspond to exercises. A binary classi!er is then trained on these labeled pairs to predict which of the given two items should be ranked higher. When using a linear binary classi!er &#119878; = &#119881;&#119879;&#119882;&#119883;(w &#119888; (u &#119890; &#8593; u &#119891; ) + &#119884;), the coe$cients of the model represent the weights of the feature di"erences, thereby indicating the importance of each feature dimension in determining the ranking.</p><p>Let exercise y &#8596; be a better choice than exercise y &#119890; for a character x. In our setting this looks as follows:</p><p>where the !rst element in the square brackets is the input to our classi!er model, and the second element is the label.</p><p>To avoid biasing the classi!er, we randomly assign a pair to either have a positive (1) or a negative (-1) label (i.e., "y &#8596; is better than y &#119890; " or "y &#119890; is worse than y &#8596; "). We !t an SVM classi!er with a linear kernel to these tuples of data with expert labels, thereby recovering the coe$cients as the expert weights w &#119875; for the scoring function: &#119876; (g(x, y), w &#119875; ) = w &#119888; &#119875; g(x, y). (For implementation details and the evaluation of the expert model see Appendix A.1.1.) We followed a similar approach to learn the human model weights from crowd-sourced data, as described in Section 5.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">APPLYING THE CONTRASTIVE EXPLANATION FRAMEWORK TO THE EXERCISE TASK</head><p>Our goal is to generate contrastive explanations (using the framework in Section 3 for the exercise recommendation task explained in Section 4). In this section, we describe this process, and we end up with contrastive explanations like the ones shown in Figure <ref type="figure">3</ref>. To do so, we use a simulated AI model (we control the accuracy of this model), generate foils using the human model weights, generate contrast concepts using our representation g(x, y), and generate the explanations using an LLM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Simulated AI model: Generating the fact</head><p>The AI model in our framework represents the common way in which models are trained for speci!c tasks (e.g., disease diagnosis) by exposing them to vast amounts of data, which allows them to identify patterns and make decisions based on learned statistical relationships. However, because these models operate solely within the con!nes of the data they have encountered, they achieve high performance in familiar decision instances, but they also make mistakes when encountering novel or poorly-represented scenarios.</p><p>To emulate real-world situations, we designed a simulated AI model such that it performs better than the average human but that it also occasionally makes mistakes. We chose to simulate the AI model because we wanted to have control over the types of mistakes the AI makes. Our formative studies indicated that unassisted people achieve on average 30% accuracy on selecting the top exercise out of 7 choices, and we designed the AI model to have an accuracy 71.4%. We used the expert model weights to decide the top exercise recommendation when the AI made a correct decision. For a given decision instance (i.e., character), the top AI suggestion is the exercise with the highest score under the expert model weights y fact = argmax y &#119875; (&#119876; (g(x, y &#119890; ), w &#119875; ). In the contrastive explanation framework, we refer to the AI generated exercise as the fact (even when it is a wrong suggestion).</p><p>To make the AI err, we chose to select a reasonable alternative from among the exercises rather than a random one, as the latter would make AI errors too obvious to participants. Therefore, the AI suggestion in such instances was the foil -the top exercise selected by the human model (as described in Section 5.2: this is always di"erent to the expert model's top exercise). When the AI errs, the new foil becomes the second-best choice from the human model. As a result, in those instances, neither the fact nor the foil corresponds to the correct answer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Human model: Generating the foil</head><p>As motivated in previous sections, we believe that contrastive explanations are most e"ective when the foil represents a likely human answer. For instance, in contexts with established guidelines, such as medical decision-making, the foil could be the guidelinerecommended action <ref type="bibr">[46]</ref>. In situations without established guidelines, the foil can be inferred from prior human decisions. In our implementation of the contrastive explanation framework for the exercise recommendation task, we chose to implement the foil as the likely human response to a given question. Speci!cally, we build a human model that predicts the exercise laypeople would select for previously unseen !ctitious characters by training on unassisted human responses. We implemented a generic model to represent human decision-making, which was su$cient for our simple task. However, depending on the context, personalized models that adapt and update as they learn more about individual users could be more appropriate.</p><p>We generated a series of !ctitious characters and ran an online study on Proli!c to collect responses from crowd workers who served as non-domain experts. See Appendix A.2.1 for details of the data collection study and the evaluation of the human model. To learn the human model weights, we followed the same procedure as we did for the expert model weights, and as described in section 4.5. Given a character and two exercises, our learned linear SVM classi!er predicted which exercise is more likely to be selected by the human non-expert. The coe$cients of the classi!er with which this decision was achieved yielded the human model weights for each concept (i.e., goal, intensity, preference). Therefore, we constructed a scoring function based on human model weights as well: &#119876; (g(x, y), w &#119892; ) = w &#119888; &#119892; g(x, y). In our implementation, we selected the foil as the exercise with the highest score under the human weights that was not the same as the expert choice: y foil = argmax y &#119875; (&#119876; (g(x, y &#119890; ), w &#119892; ), where y &#119890; &#949; y fact . This approach selects the most likely incorrect human answer. When the simulated AI was to provide a wrong suggestion (i.e., the fact was suboptimal), the output of this human model was presented as the fact, and the new foil was the second most likely incorrect human model answer: this is still an incorrect choice, but less likely to be selected by people than the !rst one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Contrast Module: Generating the contrast concepts</head><p>The goal of the contrast module is to generate the dimensions or features in which the fact and the foil di"er. Speci!cally, what aspects render the fact superior to the foil, and in what aspects (if in any) is the foil superior to the fact. In our setting, these dimensions indicate the three main concepts of the task: intensity, goal, and preference. To generate these dimensions we employed the following approach. Let y fact and y foil be the two exercises generated by the AI and the human model for character x, respectively. Our goal is to identify the dimensions in which these two exercises di"er based on the expert model's weights. For each exercise, we computed the element-wise product of the expert model weights with the joint character-exercise representation g(x, y), resulting in the weighted vectors w &#119875; &#8599; g(x, y fact ) and w &#119875; &#8599; g(x, y foil ) for the AI-generated exercise and the human model-generated exercise, respectively.</p><p>Next, we calculated the di"erence between these two weighted vectors to determine the dimensions along which the exercises di"er according to the expert model's weighting scheme. This di"erence vector, !g &#119886;&#119884; , is given by:</p><p>Non-zero dimensions of !g &#119886;&#119884; indicate where the two exercises di"er. A positive value indicates that the fact is superior to the foil in that dimension, while a negative value indicates that the foil is superior to the fact. Therefore, the contrastive module generates two sets of dimensions, dimensions for which the fact is superior to the foil: S fact = {&#119885; | !g &#119886;&#119884; [&#119885;] &gt; 0} and those for which the foil is superior to the fact S foil = {&#119885; | !g &#119886;&#119884; [&#119885;] &lt; 0}, where &#119885; denotes the dimension. Because the foil may not be superior to the fact in any dimension, S foil can be an empty set. However, by de!nition S fact &#949; &#8600;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Presentation Module: Generating interpretable explanations</head><p>Once the fact, foil, and the dimensions where they di"er are generated, the presentation module's purpose is to convert this information into a format that is easily understood by humans. We chose to implement an LLM-powered presentation module which is guided by our trusted predictive model, allowing little room for hallucinations. Given y fact , y foil , and the sets for which each are superior (S fact , S foil ), the LLM-powered presentation module adds common sense knowledge and turns the explanations into prose. Speci!cally, the LLM adds knowledge to create the mapping from the the representation space (i.e., concepts) in which the predictive model operates to the input (i.e., vignette) and output spaces (i.e., exercises). For example, let x be a !ctitious character whose goal is to lose weight. Let y fact correspond to the representation of activity running and y foil correspond to the representation of activity pilates. Further, let S fact include {goal_cardio}. In other words, running is superior to pilates because it supports cardio goals. The remaining domain knowledge required to fully understand the explanations are the following: 'cardio bene!ts weight loss', 'running is a cardio exercise' and 'pilates is not a cardio exercise'. Therefore, highlighting cardio as a di"ering dimension may not be enough without explaining those domain facts. The LLM is prompted to !ll in these knowledge gaps, given the information 'running is superior to pilates in supporting cardio goals'. Note that there is little room for the LLM to hallucinate facts, because we are constraining the generation process with the fact, foil, and concepts (S fact , S foil ) that are generated by the predictive models.</p><p>The LLM was always shown the character's vignette, and told the representation space dimensions that we identi!ed as important (from Section 4.4).</p><p>As shown in Fig <ref type="figure">2</ref>, for contrastive explanations, the LLM was given y fact , y foil , S fact , S foil . For unilateral explanations, only y fact was provided to the LLM. Templates which guided the LLM to generate the explanations are provided in the Appendix A.3. We used the OpenAI API <ref type="bibr">[76]</ref> and chose GPT-4 to generate the explanations <ref type="foot">6</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EXPERIMENT 6.1 Task description</head><p>Participants were shown vignettes of !ctitious characters and were asked to select the optimal exercise for the character in question based on their goals, capabilities, and preferences. They had to make a selection of the top exercise among 7 exercises, which were !xed choices across vignettes and alphabetically ordered in the drop-down list: aerobics, bicycling, boxing, jog/walk combination, pilates, resistance training, and swimming.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Conditions</head><p>Participants were randomized into one of the !ve conditions:no AI, unilateral, contrastive predicted, contrastive after, and contrastive random, as described in Section 3. Figure <ref type="figure">3</ref> provides a sample of a decision task with illustrations of the key conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Procedure</head><p>Participants accessed the study online through Proli!c, where they !rst provided informed consent. They then completed pre-task questionnaires, including a brief demographic survey, a six-item Need for Cognition (NFC) Scale <ref type="bibr">[61]</ref>, and a seven-item Actively Open-minded Thinking (AOT) Scale <ref type="bibr">[39]</ref>. The study consisted of three blocks: pre-test and post-test blocks, each with 5 exercise prescription tasks without AI support which served for measuring human learning, and an intervention block with 14 tasks where participants interacted with one of the AI interaction designs (or no AI, depending on their randomization). After completing the tasks, participants !lled out a shortened version of the Intrinsic Motivation Inventory (IMI) <ref type="bibr">[68,</ref><ref type="bibr">81]</ref>, a self-reported instrument intended to measure participants' subjective experience with the task, which assessed their perceived autonomy, competence, relatedness to AI, and interest/enjoyment, using 4 questions for each construct (except for relatedness, for which 3 questions where used). An additional question was included to assess mental demand.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Participants</head><p>We conducted a power analysis using G*Power <ref type="bibr">[30]</ref> to determine the required sample size for detecting a small e"ect size in our study with 5 conditions. With a small e"ect size, an &#119886; error probability of 0.05, and a desired power of 0.80, the analysis indicated that a total of 548 participants would be needed to achieve su$cient power to detect the e"ect. To account for !ltering of spammers, a total of 800 participants were recruited to complete the task via Proli!c. Participation was limited to US adults #uent in English. Recruited in batches, participants received an average compensation of $2.70 (USD) per task. To ensure a compensation rate of $12 per hour, we adjusted the payment from $2.40 in the initial small batches to $2.75 in later batches, re#ecting the median time participants spent on the study. The average age of participants was &#119887; = 35.76 (&#119871;&#119874; = 11.71) and their education distribution was 0.5% pre-high school, 19.4% high school, 75.8% college, 5.7% post-graduate degree, and 4.6% did not disclose their education.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.1">Exclusion criteria.</head><p>We retained 628 participants for analyses. To ensure meaningful engagement, participants with a median response time under 4 seconds were excluded, as this suggested insuf-!cient consideration of the tasks, which required reading vignettes and selecting exercises. Those with any response time exceeding 2.5 minutes (90th percentile) were also removed to avoid data distortion from distractions. Additionally, participants in AI-assisted conditions who performed near random (below 20% accuracy) or selected the same exercise for more than half of the study were excluded for potential misunderstanding. For subjective experience analyses, 6 participants were removed due to technical issues they encountered during the post-study questionnaire.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Approval</head><p>This study received approval from our institution's IRB under protocol number IRB21-0805.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6">Design &amp; Analysis</head><p>This study followed a between-subjects design, with the condition as the factor. Each participant interacted with one of the !ve conditions.</p><p>We collected the following indicators of performance and learning:</p><p>&#8226; Accuracy: Percentage of correct answers provided by participants in the intervention block, where a correct answer is one that matches the ground truth. &#8226; Overreliance: Percentage of answers that matched the AI's suggestions in questions for which participants received AI support and the AI's suggestion was incorrect. &#8226; Learning: Percentage of correct answers on post-intervention questions (controlled by participant's performance on preintervention questions).</p><p>For accuracy, learning, and overreliance in text and in !gures we report the marginal means produced by the regression models that included performance on pre-intervention questions as a covariate.</p><p>To assess the subjective experience, we collected the following measures assessed on a 5-point Likert scale, unless stated di"erently (See Appendix A.4 for the questionnaire):</p><p>&#8226; Perceived Competence: Four questions adapted from the Intrinsic Motivation Inventory (IMI) to measure participants' feelings of e"ectiveness and competence in the task. &#8226; Perceived Autonomy (Choice): Four questions adapted from the IMI capturing the degree of autonomy and freedom participants felt in their decision-making. &#8226; Relatedness to AI: Three questions adapted from the IMI measuring participants' sense of connection and trust in the AI. &#8226; Interest/Enjoyment: Four questions adapted from the IMI to assess participants' interest and enjoyment during the study. &#8226; Mental Demand: A single question, measuring the cognitive e"ort required by participants.</p><p>To assess the e"ects of experimental conditions on learning, accuracy, and subjective measures, we employed analysis of covariance (ANCOVA). For human learning, ANCOVA was applied to the average post-intervention correctness per participant, with pre-test performance as a covariate and condition as a !xed factor. A Shapiro-Wilk test was conducted on the residuals to check the normality assumption, which was not violated (&#119888; = .993, &#119889; = .137). Holm-Bonferroni corrections <ref type="bibr">[44,</ref><ref type="bibr">83]</ref> were used to adjust for multiple comparisons across our eight hypotheses and planned analyses related to learning. Adjusted p-values are reported wherever a correction was applied. For accuracy, we again used ANCOVA, this time on the average correctness during the intervention. Pre-test performance was included as a covariate due to its signi!cant correlation with intervention question performance, while condition was treated as a !xed factor. Subjective measures were analyzed using</p><p>** p = .09 *** * * 0.47 0.41 0.43 0.32 0.39 0.00 0.25 0.50 0.75 1.00 learning condition contrastive predicted contrastive after contrastive random unilateral no AI (a) Learning *** *** *** *** 0.56 0.54 0.57 0.31 0.58 0.00 0.25 0.50 0.75 1.00 accuracy condition contrastive predicted contrastive after contrastive random unilateral no AI (b) Accuracy ANOVA, with condition as the !xed factor. Post-hoc pairwise comparisons between conditions were corrected using Holm-Bonferroni method to account for multiple hypotheses. Throughout the results, we report e"ect sizes using Cohen's &#119890; along with 95% con!dence intervals. E"ect sizes and accompanying con!dence intervals provide valuable information, particularly when interpreting results where we hypothesize no signi!cant di"erences. When the con!dence interval for an e"ect size includes 0, it suggests that the true e"ect could be negligible or even nonexistent <ref type="bibr">[24,</ref><ref type="bibr">58,</ref><ref type="bibr">94]</ref>. For correlations, Pearson's &#119891; is provided.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Main results</head><p>7.1.1 Human Learning. Main results for learning are depicted in Figure <ref type="figure">4a</ref>. We report adjusted p-values, corrected with Holm-Bonferroni to account for multiple comparisons. As hypothesized (H-L1a &amp; H-L1b), participants experienced statistically signi!cantly greater learning in the contrastive predicted (&#119887; = 0.47, &#119892; 1,209 = 38.62, &#119889; = 0.00004, &#119890; = 0.65 [0.37, 0.94]) and contrastive after (&#119887; = 0.43, &#119892; 1,216 = 26.68, &#119889; = 0.006, &#119890; = 0.47 [0.19, 0.74]) conditions compared to participants in the no AI condition (&#119887; = 0.32).</p><p>Participants in the contrastive random condition also showed signi!cantly higher gains than those in the no AI condition (&#119887; = 0.41, &#119892; 1,230 = 23.42, &#119889; = 0.02, &#119890; = 0.40 [0.13, 0.67]). Conversely, participants in the unilateral condition did not sign!cantly improve their learnring compared to no AI (&#119887; = 0.39, &#119892; 1,222 = 29.66, &#119889; = &#119883;.&#119881;., &#119890; = 0.30 [0.03, 0.57]).</p><p>Comparing contrastive conditions with sensible foil to unilateral explanations, as hypothesized (H-L2a), participants in the contrastive predicted condition (&#119887; = 0.47) learned statistically signi!cantly more than participants in the unilateral condition (&#119887; = 0.39, &#119892; 1,260 = 40.99, &#119889; = 0.02, &#119890; = 0.35 [0.11, 0.60]). However, the di"erence between contrastive after (&#119887; = 0.43) and unilateral explanations was not signi!cant (&#119892; 1,267 = 33.05, &#119889; = &#119883;.&#119881;., &#119890; = 0.16 [&#8593;0.08, 0.40]), not lending support to H-L2b.</p><p>Within the contrastive conditions, participants in the contrastive predicted condition demonstrated greater learning (&#119887; = 0.47) compared to those in the contrastive random condition (&#119887; = 0.41).</p><p>0.58 0.53 0.6 0.58 0.00 0.25 0.50 0.75 1.00 overreliance condition contrastive predicted contrastive after contrastive random unilateral However, this di"erence was only marginally signi!cant (&#119892; 1,268 = 31.82, &#119889; = 0.09, &#119890; = 0.26 [0.02, 0.50]), o"ering partial support for hypothesis H-L3. Addressing research question RQ-L1, the contrastive predicted condition did not result in signi!cantly di"erent learning compared to the contrastive after condition (&#119887; = 0.43, &#119892; 1,254 = 41.44, &#119889; = &#119883;.&#119881;., &#119890; = 0.18 [&#8593;0.07, 0.43]). 7.1.2 Accuracy and Overreliance. Figure 4b summarizes results of human accuracy on the decision task with di"erent conditions. As hypothesized (H-A1a &amp; H-A1b), the accuracy of participants in the contrastive predicted (&#119887; = 0.56, &#119892; 1,260 = 20.04, &#119889; = &#119883;.&#119881;., &#119890; = &#8593;0.15 [&#8593;0.39, 0.10]) and contrastive after (&#119887; = 0.57, &#119892; 1,267 = 9.13, &#119889; = &#119883;.&#119881;., &#119890; = &#8593;0.08 [&#8593;0.32, 0.16]) conditions was not signi!cantly di"erent from that of participants in the unilateral condition (&#119887; = 0.58).</p><p>While participants improved their performance on the task signi!cantly on average when they received AI support (&#119887; = 0.56) compared to receiving no AI support (&#119887; = 0.31, &#119892; 1,627 = 195.32, &#119889; &#8771; 0.0001), their performance also signi!cantly degraded when AI suggestions were suboptimal (&#119887; = 0.14) compared to receiving no support (&#119887; = 0.29, &#119892; 1,627 = 36.34, &#119889; &#8771; 0.0001). Note that the di"erent means for no AI support (&#119887; = 0.31, &#119887; = 0.29) in this analysis occur because we split the performance of participants in the no AI condition based on whether AI, if provided, would have been correct or incorrect, to allow a fairer comparison with other</p><p>3.74 3.72 3.35 3.34 3.65 3.92 3.91 3.66 4.09 3.69 3.38 3.4 2.98 3.4 3.75 3.77 3.56 3.56 3.64 2.43 2.54 2.56 2.66 2.5 competence autonomy relatedness to AI task enjoyment mental demand 1 2 3 4 condition contrastive predicted contrastive after contrastive random unilateral no AI conditions that received incorrect suggestions for only a subset of questions.</p><p>In situations when AI provided a suboptimal recommendation, participants in the contrastive predicted (&#119887; = 0.58, &#119892; 1,260 = 6.53, &#119889; = &#119883;.&#119881;., &#119890; = 0.03 [&#8593;0.22, 0.27]) and contrastive random (&#119887; = 0.54, &#119892; 1,281 = 3.83, &#119889; = &#119883;.&#119881;., &#119890; = &#8593;0.12 [&#8593;0.35, 0.12]) exhibited similar overreliance as those in the unilateral condition (&#119887; = 0.58), addressing RQ-A1.</p><p>Similarly, presenting contrastive explanations immediately (contrastive predicted), resulted in similar overreliance (&#119887; = 0.58) compared to presenting contrastive explanations after a decision was made (contrastive after) (&#119887; = 0.59, &#119892; 1,254 = 10.61, &#119889; = &#119883;.&#119881;., &#119890; = &#8593;0.01 [&#8593;0.25, 0.24]). 7.1.3 Subjective Experience. Subjective results are summarized in Figure <ref type="figure">6</ref>.</p><p>Condition was a signi!cant predictor of perceived competence (&#119892; 4,618 = 6.40, &#119889; &#8771; 0.00005). A Holm-Bonferroni corrected post-hoc test revealed that participants in the contrastive predicted, contrastive random, and unilateral conditions reported signi!cantly higher competence compared to those in the contrastive after and no AI conditions.</p><p>Perceived autonomy (i.e., choice) was also signi!cantly predicted by condition (&#119892; 4,618 = 8.85, &#119889; &#8771; 0.00001). A Holm-Bonferroni corrected post-hoc test revealed that participants in the contrastive predicted, contrastive random, and no AI conditions perceived signi!cantly higher autonomy compared to those in the unilateral and contrastive after conditions.</p><p>Condition was also a signi!cant predictor of relatedness to AI (computed only for conditions involving AI) (&#119892; 3,533 = 9.02, &#119889; &#8771; 0.00001), with a Holm-Bonferroni post-hoc test revealing that participants in the contrastive after condition felt signi!cantly less related to the AI compared to those in other AI conditions.</p><p>Task enjoyment/interest was not signi!cantly predicted by condition (&#119892; 4,618 = 1.66, &#119889; = &#119883;.&#119881;.) and neither was mental demand (&#119892; 4,618 = 0.47, &#119889; = &#119883;.&#119881;.).</p><p>Across measures, the subjective results support H-S2, showing that contrastive explanations with a predicted foil led to signi!cantly higher perceptions of competence, autonomy, and relatedness to the AI compared to the contrastive after condition. Additionally, our !ndings partially support H-S1: while contrastive explanations with a predicted foil signi!cantly increased perceived autonomy compared to unilateral explanations, no signi!cant differences were observed for competence and relatedness to the AI.</p><p>7.1.4 Subjective vs. objective measures. Figure <ref type="figure">7</ref> shows the relationship between subjective experience and objective outcomes across conditions with AI support (i.e. , no AI condition was not included in the analysis). Our analysis revealed that there was no correlation between actual learning and competence, autonomy, or mental demand and that actual learning was very weakly inversely correlated with relatedness to AI (&#119891; = &#8593;0.08, &#119889; = 0.06) and task enjoyment (&#119891; = &#8593;0.09, &#119889; = 0.04). Accuracy was signi!cantly positively correlated with relatedness to AI (&#119891; = 0.26, &#119889; &#8771; 0.0001), a construct that included questions about trust in AI too, and it was signi!cantly negatively correlated with perceived autonomy (&#119891; = &#8593;0.15, &#119889; = 0.0003). Similarly, overreliance was signi!cantly positively correlated with relatedness to AI (&#119891; = 0.22, &#119889; &#8771; 0.0001), and signi!cantly negatively correlated with perceived autonomy (&#119891; = &#8593;0.15, &#119889; = 0.0006). In addition, overreliance was signi!cantly positively correlated with task enjoyment (&#119891; = 0.1, &#119889; = 0.02).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Audit for intervention-generated inequalities</head><p>Intervention-generated inequalities occur when an intervention, while bene!cial on average, disproportionately bene!ts some groups over others <ref type="bibr">[63]</ref>. Disaggregating results by relevant demographics or variables can help uncover these disparities. Informed by prior research in AI-assisted decision-making <ref type="bibr">[11,</ref><ref type="bibr">33]</ref>, we conduct a selfaudit and examine whether contrastive explanations, introduced as interventions to enhance human decision-making skills, bene!t di"erent groups equally. Previous studies have shown that individual di"erences in information processing can signi!cantly impact the e"ectiveness of AI support and interventions, particularly for cognitively demanding outcomes like learning. One individual di"erence that may a"ect the e"ectiveness of our interventions is Need for Cognition (NFC), a stable trait that re#ects an individual's motivation to engage in deep thinking and information processing <ref type="bibr">[17]</ref>. NFC has been consistently identi!ed as a predictor of performance in cognitive tasks such as problem-solving and decision-making <ref type="bibr">[18]</ref>. In the context of AI-assisted decision-making, NFC has been found to in#uence whether cognitive forcing reduces overreliance on AI <ref type="bibr">[14]</ref> and how e"ectively individuals learn from AI assistance <ref type="bibr">[15,</ref><ref type="bibr">33]</ref>.  0.43 0.41 0.49 0.47 0.42 0.42 0.34 0.31 0.4 0.38 contrastive after contrastive predicted contrastive random no AI unilateral 0.0 0.2 0.4 0.6 learning high NFC low NFC (a) Need for Cognition (NFC) 0.42 0.41 0.52 0.42 * * 0.46 0.38 * * 0.34 0.31 0.4 0.38 contrastive after contrastive predicted contrastive random no AI unilateral 0.0 0.2 0.4 0.6 learning high AOT low AOT (b) Actively Open-minded Thinking (AOT) Figure 8: Auditing for intervention generated inequalities: Learning (marginal means) for di"erent individual di"erences. Error bars represent one standard error. Signi!cance levels after Holm-Bonferroni correction are indicated by: * p &lt; 0.05.</p><p>Another important individual di"erence that we reasoned would be particularly relevant for interventions that require consideration of multiple viewpoints is Actively Open-Minded Thinking (AOT). People high in AOT are more likely to critically evaluate new evidence, weigh it against their existing beliefs, take su$cient time to solve problems, and carefully consider others' opinions when forming their own <ref type="bibr">[6,</ref><ref type="bibr">39]</ref>. We investigate whether individuals with varying levels of AOT bene!t di"erently from contrastive explanations, which provide alternative "viewpoints" for consideration.</p><p>Figures <ref type="figure">8a</ref> and <ref type="figure">8b</ref> depict results disaggregated by NFC and AOT. We did not !nd any signi!cant di"erences among the e"ectiveness of (any) contrastive explanations for people with di"erent levels of NFC (for detailed ananlyses see Appendix, Table <ref type="table">2</ref>). However, our !ndings reveal a notable contrast in the AOT groups: participants with high AOT bene!ted signi!cantly more from the contrastive predicted condition (&#119887; = 0.52) compared to those with low AOT (&#119887; = 0.42; &#119892; 1,122 = 6.67, &#119889; = 0.01, &#119890; = 0.47 [0.11, 0.84]). Similarly, the contrastive random condition was more e"ective for individuals with high AOT (&#119887; = 0.46) than for those with low AOT (&#119887; = 0.42; &#119892; 1,142 = 3.77, &#119889; = 0.05, &#119890; = 0.34 [&#8593;0.01, 0.68]). See Table <ref type="table">1</ref> for non-signi!cant conditions. These !ndings uncover AOT as a relevant individual di"erence to consider in AI-assisted decisionmaking and reveal that contrastive explanations may unevenly impact individuals, o"ering greater advantages to those with higher AOT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">DISCUSSION</head><p>We investigated whether AI decision support systems that account for the decision-maker's mental model of the task and explain their misconceptions can simultaneously enhance decision accuracy and promote the development of independent decision-making skills.</p><p>8.1 On the e"ectiveness of contrastive explanations in improving human-AI decision-making outcomes</p><p>As expected, our results showed that participants learned signi!cantly more with contrastive explanations with predicted foil compared to unilateral explanations (H-L2a) or no AI support (H-L1a) (Figure <ref type="figure">4a</ref>). Moreover, also as hypothesized (H-A1a), this improvement in learning was achieved without sacri!cing accuracy: participants completing the task with contrastive explanations with predicted foil were as accurate as their counterparts who received unilateral explanations (Figure <ref type="figure">4b</ref>). Additionally, participants in the contrastive explanations with predicted foil condition reported signi!cantly greater perceived autonomy (but not competence or relatedness to AI) during the task compared to those in the unilateral condition, providing partial support for HS-1.</p><p>The contrastive after condition, where participants received contrastive explanations after making an initial decision (inputted foil), led to signi!cant learning gains compared to receiving no AI support (lending support to H-L1) but not signi!cantly di"erent learning compared to unilateral explanations (not supporting H-L2b). As expected (H-A1b), participants' accuracy in the contrastive after condition was not signi!cantly di"erent from those in the unilateral condition.</p><p>Overall, our research provides compelling evidence that contrastive explanations with predicted foils signi!cantly enhance decision-making skills without sacri!cing decision accuracy compared to unilateral explanations, which remain the default explanation design in AI-powered decision support. Our study is the !rst to demonstrate that even when AI o"ers decision recommendations (rather than explanations alone <ref type="bibr">[15,</ref><ref type="bibr">33]</ref>), users can still cognitively engage with its content and improve their learning about the task when this content is engaging. This !nding opens new possibilities for optimizing AI decision-support systems by intervening not only at the interaction level, as previous work suggests <ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">105]</ref>, but also at the content level of the explanations themselves to improve human-AI decision-making outcomes.</p><p>Lastly, while contrastive explanations with predicted foil improved human decision-making skills, we do not think they are a panacea for human-AI decision-making. For example, our results showed that contrastive explanations (as well as unilateral explanations) still resulted in signi!cant overreliance on AI. Also, they were sign!cantly more e"ective for people high in AOT (who are inherently driven to consider multiple viewpoints) compared to those low in AOT. Instead, we believe that contrastive explanations are useful when shown in the right situations, such as when the AI is con!dent about its decision, and to people who bene!t from them (e.g., those high in AOT). As such, these explanations expand the suite of human-AI interaction techniques that can be adaptively selected in appropriate situations to optimize human-AI decision-making outcomes, like decision accuracy and human learning <ref type="bibr">[8,</ref><ref type="bibr">15,</ref><ref type="bibr">93]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">What have we learned about the design of contrastive explanations?</head><p>Our results provide evidence about which aspects of contrastive explanations matter for objective and subjective outcomes in human-AI decision making. First, as hypothesized (H-S2), our !ndings show that interaction design matters for subjective experience: contrastive explanations are as e"ective in objective measures when the foil is predicted as they are when the foil is inputted (the contrastive after condition) -even though the inputted foil is the "perfect" comparison. However, consistent with prior research <ref type="bibr">[11,</ref><ref type="bibr">32]</ref>, our results show that providing contrastive explanations after people make their own decisions (input their foil) results in signi!cantly lower subjective experience, even if that advice engages with their own input as in contrastive explanations after condition. We found no di"erences in subjective experience between contrastive explanations with a predicted foil and those with a random foil, suggesting that the contrastive design, applied before a decision is made, is perceived favorably regardless of the foil's quality.</p><p>Second, our results show suggestive evidence that quality of the foil matters for improving learning as the objective outcome of the interaction. When contrastive explanations are presented at the decision-making time, high quality foil such as in contrastive predicted resulted in greater learning on average compared to a randomly selected foil, albeit the di"erence was only marginally signi!cant, partially supporting HL-3. We believe that one of the reasons why the di"erence between these two conditions is not more pronounced in our study is that even a "random" foil in our setting is relatively reasonable. A randomly selected exercise from the list still addresses at least part of the needs or preferences of the !ctitious character, rendering it a choice worth considering. We believe that in di"erent situations, such as medical treatment decisions, where the choices may consist of a wide variety of treatments for a wide array of diseases, a randomly selected choice would likely be obviously ine"ective or harmful, thus a waste of cognitive resources for the clinician to consider. In addition, we believe there is room to further improve the quality of the predicted foil. In our implementation, the foil was generated using a single model that predicted the average human response across all decision-makers. We believe that employing personalized models, which capture each individual's unique mental model of the task, could result in even more accurate foils and, consequently, lead to greater learning gains. Our analysis of the participants' responses used to train the human model revealed high variability in exercise choices across participants (Appendix A.2.2), further supporting the need for personalized models. Future research should investigate how to best !ne-tune models to individuals and assess the added value of personalized models compared to average human models for enhancing downstream human-AI decision outcomes.</p><p>In this study, we sought to deepen our understanding of the timing of contrastive explanations and the impact of foil quality. We experimented with a simple, intuitive design in which the foil represented the choice of many people, while the fact re#ected the AI's suggestion in the user interface. However, contrastive reasoning can be conveyed in various other forms. For example, two conversational agents-one advocating for the human model's choice and the other for the AI's-could engage in a dialogue, allowing the decision-maker to assess which agent's reasoning is more compelling. Alternatively, designs could focus solely on contrastive dimensions, rather than the fact and foil, by highlighting aspects of the decision that the human decision-maker may be overlooking. This approach could provide insights as intermediate support without o"ering a direct recommendation (e.g., cardio supports weight loss goals). Having demonstrated the e"ectiveness of one human-AI contrastive design in promoting learning, we believe future research should explore a wider range of design possibilities for representing human-AI misalignment in even more impactful ways.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">What have we learned about the e"ects of contrastive explanations on overreliance?</head><p>Evidence from prior work suggested that presenting more than one AI suggestion (i.e., a "second opinion") to people may reduce their overreliance, as it makes them more likely to consider alternatives <ref type="bibr">[5,</ref><ref type="bibr">64]</ref>. Our results showed that participants in contrastive explanations conditions (with predicted or random foils) exhibited similar rates of overreliance on AI suggestions as those in unilateral condition (RQ-A1). We believe that we may not be observing the bene!cial e"ect of second opinion in our study because in situations when the simulated AI provided incorrect recommendations (i.e., when the "fact" was a suboptimal choice), the foil was an even worse choice. Therefore, participants were primed to contrast two suboptimal choices and resorted to the better choice out of the two. Future work should explore whether contrastive explanations would still result in similar overreliance, when the foil is a better alternative than the fact. Interestingly, we also found that participants' overreliance rate on AI in contrastive after condition was similar to that of the unilateral condition. This contrasts with prior research showing that providing unilateral explanations after an initial decision reduces overreliance <ref type="bibr">[14,</ref><ref type="bibr">36]</ref>, as people are less likely to follow incorrect AI advice once they have made a decision. In our study, because contrastive explanations directly addressed participants' decision and provided evidence as to why their choice was inferior to the AI's, they seemed more persuasive, potentially diminishing the positive e"ect of the cognitive forcing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.4">What have we learned about intrinsic motivation in AI-assisted decision-making?</head><p>We measured participants' perceived competence, autonomy, and relatedness to AI as psychological needs underpinning individuals' intrinsic motivation about a task. Our results demonstrate that both interaction and explanation design signi!cantly impact these constructs. First, as hypothesized H-S2, we found that the contrastive after condition-in which contrastive explanations "critiqued" individuals' inputted answer and presented evidence that AI's choice was superior-led to signi!cantly lower perceived competence, autonomy and relatedness to AI compared to situations in which contrastive explanations were presented before a decision was made. Second, our results demonstrated that contrastive explanations provided before a decision (whether using a predicted or random foil), which presented two decision choices, led to signi!cantly higher perceived autonomy in task completion-comparable to participants who received no AI support-compared to unilateral explanations that o"ered only a single option. Our analysis of intrinsic motivation constructs and objective outcomes revealed that actual learning was not correlated with perceived competence. Increased perceived autonomy was correlated with reduced overreliance but also with lowered accuracy, while stronger perceptions of relatedness to the AI were correlated with greater overreliance on AI and higher accuracy.</p><p>These !ndings suggest that the design of AI support can signi!cantly in#uence people's intrinsic motivation toward a task, as well as objective outcomes such as accuracy and overreliance. We believe that when developing new AI-assisted decision-making systems, researchers should carefully consider and measure how these designs a"ect people's intrinsic motivation about the task in addition to the objective outcomes of the interaction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.5">Generalizability &amp; Limitations</head><p>We conducted a single controlled experiment with a single task and with crowdworkers. While prior research on AI-assisted decisionmaking suggests that experts often exhibit similar behavior to nonexperts when relying on AI systems <ref type="bibr">[34]</ref>, we do not know whether this holds for learning from the AI about a task of their expertise. Jacobs et al. <ref type="bibr">[46]</ref> show that clinicians would prefer a system that explains why AI's choice di"ered from the established clinical guidelines, which suggests they may be open for learning from the AI. Our task choice had inherent learning opportunities (e.g., facts about exercises). Learning may not be as pronounced in some tasks, such as hiring, were opportunities for learning exist but are sparser. Moreover, we assessed only the short-term e"ects of the explanation designs on learning. The long-term bene!ts and whether the observed learning gains will persist over time remain unstudied. Further research is needed to understand how generalizeable our !ndings are for other tasks, domains, and settings.</p><p>We believe that our contrastive explanations framework can be e"ectively applied to a wide range of tasks and settings. Its modular design allows for #exible adaptation based on speci!c contexts. What we called an expert model, in a new domain can be replaced by the predictive model that drives the decision recommendations. More #exibility exists in how the human model is constructed to generate sensible foils. If data exist on how human decision-makers made decisions in a domain, those data can be used to train the human model. In domains with established guidelines-where AI is introduced to enable more nuanced decisions than were previously possible -the foil could be derived from those guidelines, enabling a comparison between established practices and AI-based predictions. Alternatively, the foil could re#ect a personalized human model, such as one trained on an individual clinician's past choices. The other modules can also be customized according to the domain. For instance, the contrast module could compute pixel gradients or concept activations <ref type="bibr">[54]</ref> that highlight di"erences between the fact and foil. Similarly, the presentation module can be customized to suit the task, such as employing tailored visualization techniques. However, we believe that contrastive explanations-and by extension, our framework-are most valuable in multiclass classi!cation or ranking scenarios, where the foil is less obvious than in binary decision contexts.</p><p>An important consideration about our work is that we chose to implement the presentation module with a large language model (LLM). We used the LLM to turn the sca"old produced by the rest of the modules into a natural language explanation, while providing small gaps in the template for it to !ll with domain facts. We believe the approach of constraining the generation of facts within the constraints of more trusted predictive models may be useful for certain settings, such as ours but may not generalize to expert-level domains where the LLM may not have the nuance to !ll in the gaps. Moreover, we iteratively arrived at prompts (included in the Appendix) which produced explanations with almost no hallucinations. The !rst author of the paper reviewed the generations for all the characters included in the experiment, !nding the LLMs to only mix the indoor/outdoor preferences of characters at times, but no other major hallucinations. However, such manual review cannot be scaled. We believe that our framework can be extended to include a veri!cation step for the generated explanations. For example, multiagent frameworks <ref type="bibr">[101]</ref> can be used with additional agents reviewing the generated explanations.</p><p>Another limitation of our study is that, in order to control the AI's mistakes, we chose to simulate the AI. We introduced errors in four randomly selected questions during the intervention phase, where the AI's suggestion was generated by a human model rather than the expert model. This approach may have contributed to overreliance on the AI, as the wrong AI suggestion was a likely human choice.</p><p>Finally, in this study, we investigated the impact of explanation design on the learning of factual information. While our study was intentionally structured so that the AI consistently supported its reasoning with truthful facts, there remains a potential, yet unstudied (to the best of our knowledge), risk of inadvertent outcomes or misuse of designs that support learning. This risk could manifest as the propagation of misinformation if the AI were to provide incorrect explanations. To address this concern, we strongly advocate for the use of contrastive explanations -or any form of explanations -in conjunction with an intelligent interaction layer (e.g., as in <ref type="bibr">[15,</ref><ref type="bibr">75,</ref><ref type="bibr">92]</ref>). Such a layer would ensure that explanations are delivered exclusively when the predictive model demonstrates high con!dence in its recommendations and reasoning, thereby reducing the likelihood of overreliance and minimizing the risk of misinformation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">CONCLUSION</head><p>In this work, we investigated whether explanation designs that account for human reasoning can improve human decision-making skills in the task in AI-assisted decision-making. We introduced a framework for generating human-centric contrastive explanations by showing the di"erence between AI's reasoning and a likely human response for the same task. Our results demonstrated that contrastive explanations signi!cantly enhanced human decisionmaking skills compared to unilateral explanations, the default method of AI support, without compromising accuracy. Sparking hope about growing deskilling concerns, our work suggests that AI support that accounts for human mental models of the task can be a promising approach toward systems that augment and upskill decision-makers.    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.6 Results: Additional Analysis for Learning</head><p>To gain deeper insights into people's learning throughout the study, we conducted an additional analysis of the no AI and contrastive after conditions, focusing on participants' unassisted initial answers (Figure <ref type="figure">12</ref>). Our results indicate that the initial choices participants provided in the contrastive after condition signi!cantly improved over the course of the study (&#119891; = &#8593;0.15, 95%&#119895;&#119896; [&#8593;0.18, &#8593;0.11], &#119889; &lt; .0001; CI based on 1000 bootstrap samples). Even when participants provided incorrect initial guesses, those guesses progressively ranked higher according to the expert weights over time (&#119891; = &#8593;0.15, 95%&#119895;&#119896; [&#8593;0.19, &#8593;0.11], &#119889; &lt; .0001). Participants in the no AI condition did not exhibit signi!cant overall improvement over time (&#119891; = &#8593;0.01, 95% CI [-0.05, 0.03], &#119889; = &#119883;.&#119881;.). However, a very weak correlation of improvement was observed for instances where they provided incorrect choices (&#119891; = &#8593;0.05, 95% CI [-0.11, 0.002], &#119889; = 0.04). Table 2: ANCOVA results by condition for NFC groups, showing marginal means (SE), Signi!cance (F-statistic, p-value), and E"ect size (Cohen's d with 95% con!dence intervals). Received 12 September 2024; revised 10 December 2024; accepted 16 January 2025</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.7 Results</head></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In this paper, we use the terms "improving human learning" and "improving decisionmaking skills" interchangeably.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>data.census.gov</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>data.cdc.gov</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>bls.gov</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>&#119871; &#119872; 2&#119872;&#119873;&#119874; = 48.392 &#8593; 0.088(&#119873;&#119874;&#119875; ) + 12.335(&#119876;&#119875;&#119877;; &#119878;&#119875;&#119879; = 1, &#119880;&#119881;&#119878;&#119875;&#119879; = 0) &#8593; 0.386(&#119882;&#119883;&#119884; ) + 0.693(&#119885;&#119886;)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>The !rst author manually reviewed the generated explanations to verify whether the LLM introduced any hallucinations; we elaborate on this process in the limitations section.</p></note>
		</body>
		</text>
</TEI>
