<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Mix and Match: Characterizing Heterogeneous Human Behavior in AI-assisted Decision Making</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the Twelfth AAAI Conference on Human Computation and Crowdsourcing</publisher>
				<date>10/15/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10588068</idno>
					<idno type="doi">10.1609/hcomp.v12i1.31604</idno>
					<title level='j'>Proceedings of the AAAI Conference on Human Computation and Crowdsourcing</title>
<idno>2769-1330</idno>
<biblScope unit="volume">12</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhuoran Lu</author><author>Syed Hasan Amin_Mahmoo</author><author>Zhuoyan Li</author><author>Ming Yin</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>AI-assisted decision-making systems hold immense potential to enhance human judgment, but their effectiveness is often hindered by a lack of understanding of the diverse ways in which humans take AI recommendations. Current research frequently relies on simplified, ``one-size-fits-all'' models to characterize an average human decision-maker, thus failing to capture the heterogeneity of people's decision-making behavior when incorporating AI assistance. To address this, we propose Mix and Match (M&M), a novel computational framework that explicitly models the diversity of human decision-makers and their unique patterns of relying on AI assistance. M&M represents the population of decision-makers as a mixture of distinct decision-making processes, with each process corresponding to a specific type of decision-maker. This approach enables us to infer latent behavioral patterns from limited data of human decisions under AI assistance, offering valuable insights into the cognitive processes underlying human-AI collaboration. Using real-world behavioral data, our empirical evaluation demonstrates that M&M consistently outperforms baseline methods in predicting human decision behavior. Furthermore, through a detailed analysis of the decision-maker types identified in our framework, we provide quantitative insights into nuanced patterns of how different individuals adopt AI recommendations. These findings offer implications for designing personalized and effective AI systems based on the diverse landscape of human behavior patterns in AI-assisted decision-making across various domains.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The increasing integration of artificial intelligence (AI) into people's decision-making processes across diverse domains, from entertainment to healthcare and to finance (De Mantaras and Arcos 2002; Shaheen 2021; Cao 2022), has initiated a new era of human-AI collaboration. Combining AI's competence and humans' agency, the paradigm of AIassisted decision-making, where AI models provide recommendations and humans make the final decisions, holds immense potential to enhance human judgment and improve decision outcomes <ref type="bibr">(Lysaght et al. 2019;</ref><ref type="bibr">Lai et al. 2021)</ref>. However, realizing this potential hinges on a deep understanding of how humans interact with and adopt AIgenerated recommendations <ref type="bibr">(Steyvers and Kumar 2023)</ref>.</p><p>Although a growing body of research has focused on quantitatively describing how human decision-makers respond to AI recommendations, these studies suffer from a few limitations. Some approaches, particularly those rooted in deep learning, treat the problem as a mere prediction task without considering the cognitive underpinnings and interpretability of the decision-making process <ref type="bibr">(Hartford, Wright, and Leyton-Brown 2016)</ref>. Despite the high performance, these models provide little insight into the underlying reasons behind people's decision behavior. While some recent work has attempted to incorporate cognitive processes for characterizing behavioral patterns in AI-assisted decision-making, these works often rely on an "average" human decision-maker representation <ref type="bibr">(Wang, Lu, and Yin 2022;</ref><ref type="bibr">Tejeda et al. 2022)</ref>. This simplification overlooks the inherent diversity in people's decision-making patterns under AI assistance, potentially resulted from individual preferences, risk tolerances, and cognitive styles <ref type="bibr">(Franken and Muris 2005;</ref><ref type="bibr">Appelt et al. 2011)</ref>. Neglecting this diversity impedes the development of personalized AI assistance and restricts our ability to fully harness AI's potential in augmenting human decision-making.</p><p>To address these gaps, we propose Mix and Match (M&amp;M), a computational framework designed to model the diverse ways in which humans interact with and adopt AI recommendations. M&amp;M explicitly acknowledges the heterogeneity of human decision-makers. The framework operates in two main stages: "Mix" and "Match". In the "Mix" stage, M&amp;M considers K distinct decision-making processes, each representing a different type of decision-maker. Specifically, each decision process captures the cognitive process the corresponding type of decision-maker goes through to generate their AI-assisted decisions-the decision-maker first forms their independent judgments without AI assistance, and then aggregates their independent judgments with the AI model's recommendations to arrive at a final decision after computing the utilities of different possible actions. We assume that each decision made by an individual is influenced by a probability distribution over these K types, with a latent variable indicating the specific types of decision-makers responsible for that decision. Thus, during this stage, given a set of AIassisted decisions made by a population of decision-makers, we jointly learn the latent variables and parameters for each decision-maker type. Next, in the "Match" stage, given a new individual decision-maker, we estimate the likelihood that each decision-maker type is responsible for this individual's final decision on a particular decision task, and predict their final decision accordingly. Note that M&amp;M leverages the varied decision-making behavior across the population to uncover underlying patterns. Additionally, M&amp;M acknowledges that the same individual may exhibit different decision-making behaviors depending on the context or task, providing a more nuanced understanding of the dynamic nature of AI-assisted decision-making.</p><p>Using real-world behavioral data collected from diverse decision-making scenarios, our empirical evaluation demonstrates that M&amp;M consistently outperforms baseline methods in predicting human decisions under AI assistance. By accurately characterizing the different types of decision-makers and their unique adoption patterns, our framework offers valuable insights into the cognitive processes underlying human-AI collaboration. For example, our analysis reveals that there exists a range of decision-makers with different perceptions of penalties for incorrect decisions and sensitivities to utility differences in accepting or rejecting AI recommendations. In addition, the majority of decision-makers perceive high penalties for incorrect decisions and exhibit high sensitivity to utility differences. Furthermore, perceptions of penalties for incorrect decisions and sensitivity to utility differences tend to be positively correlated in AIassisted decision-making. These insights can inform the design of more effective and personalized AI assistance, ultimately leading to improved AI-assisted decision-making outcomes in various domains.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work Empirical Studies in AI-assisted Decision Making</head><p>The increasing use of AI-powered decision aids has spurred a wave of experimental studies aimed at understanding how humans interact with and rely on AI models in decisionmaking scenarios. Researchers have identified a multitude of factors that can influence people's reliance on AI in decision-making-on a population level-including the model's accuracy <ref type="bibr">(Yin, Wortman Vaughan, and Wallach 2019;</ref><ref type="bibr">Lai and Tan 2019)</ref>, confidence <ref type="bibr">(Zhang, Liao, and Bellamy 2020;</ref><ref type="bibr">Rechkemmer and Yin 2022)</ref>, the type and presentation of AI explanations <ref type="bibr">(Yang et al. 2020;</ref><ref type="bibr">Bansal et al. 2021b)</ref>, individuals' mental models of AI <ref type="bibr">(Bansal et al. 2019a,b)</ref>, the degree of agreement between human judgment and AI recommendations <ref type="bibr">(Lu and Yin 2021)</ref>, and more.</p><p>Beyond factors existing at the population level, recent studies also found that individual differences make significant impacts on how humans take AI recommendations. For instance, it was found that an individual's personality affects their trust in and advice-taking from AI <ref type="bibr">(Sharan and Romano 2020)</ref>. As another example, <ref type="bibr">Matthews et al. (2019)</ref> found that people could activate different mental models when collaborating with AI, thus leading to diverse attitudes towards AI. These studies have revealed a wide array of behavioral patterns exhibited by decision-makers in AI-assisted con-texts, highlighting the importance of characterizing the diversity in human behavior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Modeling Human Decision Behavior</head><p>Research on modeling human decision behaviors has been well-established in economy and psychology, as decisionmaking is an abstract of a wide range of human behaviors <ref type="bibr">(Wang and Ruhe 2007;</ref><ref type="bibr">Montgomery 1983)</ref>. Pivoting around this, a large amount of theories and models were developed to capture how people make decisions. For instance, the expected utility theory links behavior with the utilities behind decisions <ref type="bibr">(Schoemaker 1982)</ref>. Research further reveals that factors including task context and individual differences can lead to people's different ways of calculating the utilities of their actions <ref type="bibr">(Schoemaker 2013)</ref>.</p><p>With AI-based decision aids becoming more prevalent, the community started to investigate computational modeling of human behavior in AI-assisted decision-making, with a focus on characterizing and predicting when decisionmakers will solicit or rely on AI recommendations (Pynadath, <ref type="bibr">Wang, and Kamireddy 2019;</ref><ref type="bibr">Bansal et al. 2021a;</ref><ref type="bibr">Li, Lu, and Yin 2023;</ref><ref type="bibr">Kumar et al. 2021;</ref><ref type="bibr">Wang, Lu, and Yin 2022;</ref><ref type="bibr">Guo et al. 2024;</ref><ref type="bibr">Strickland et al. 2024)</ref>. Drawing inspiration from economic theories (e.g., Cumulative Prospect Theory <ref type="bibr">(Tversky and Kahneman 1992;</ref><ref type="bibr">Allais 1953</ref>)) or cognitive modeling exemplified by sociocognitive construct <ref type="bibr">(Askarisichani et al. 2022)</ref>, previous work made efforts in constructing computational models with the capability to explain human decision-making under modern AI systems with uncertainty. These models have also been used to improve AI-assisted decision-making by enabling AI systems to adapt their recommendations based on human behaviors or by designing interfaces that adjust how AI recommendations are presented depending on how people behave <ref type="bibr">(Ma et al. 2023;</ref><ref type="bibr">Amin, Lu, and Yin 2024)</ref>. However, most of these studies model decision behaviors using an "average" human decision-maker to represent the entire population, overlooking the diversity in decision-making patterns that can result from individual differences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Problem Setup</head><p>We focus on the AI-assisted decision-making setting, where a human decision-maker (DM) completes a sequence of T tasks, receiving a decision recommendation from an AI model on each task but making the final decision by themselves. This setting is particularly prevalent in high-stakes domains such as medical diagnosis, where the human retains the ultimate authority to make the final decision. Each decision making task t &#8712; {1, . . . , T } is characterized by features x t &#8712; R n and an associated correct decision y t &#8712; Y. For illustrative purposes and without loss of generality, our study centers on binary classification tasks (i.e., Y = {0, 1}).</p><p>Under this setup, an AI model first provides a decision recommendation m(x t ; &#952; m ) to a human DM, who has their own independent judgment h(x t ; &#952; h ) on the same case. The human DM then aggregates the AI's suggestion with their own assessment to arrive at a final team decision &#375;t :</p><p>) &#120579; &#8462; &#119910; &#119905; &#119911; &#120579; &#119886; &#119961; &#119905; &#120579; &#119898; Independent Judgement Model Aggregation Model Type of Decision Maker Independent Judgement AI Recommendation Task AI Model (&#119910; &#119898; &#119905; , &#119888; &#119898; &#119905; ) Decision (&#119910; &#8462; &#119905; , &#119888; &#8462; &#119905; )</p><p>Figure <ref type="figure">1</ref>: The probabilistic model of the generation process of the decision-maker's final decision in AI-assisted decision-making. The shaded node is observed.</p><p>We consider the scenario where the AI decision recommendation comprises two components: a binary decision and the confidence in that decision. Formally, m(x t ; &#952; m ) = {1 : P(y t = 1 | x t ), 0 : P(y t = 0 | x t )}, which can be further used to generate the binary recommendation &#375;t m = arg max m(x t ; &#952; m ) and the confidence in that recommen-</p><p>) is influenced by the task features x t , the AI model's decision recommendation &#375;t m , the AI model's confidence c t m , the human DM's initial judgment &#375;t h , and the human DM's confidence c t h . Since final team performance is often of paramount interest, it is critical to understand the form of the team decision making model f (&#8226;).</p><p>While AI model parameters (&#952; m ) can be accessible, human judgment and aggregation parameters (&#952; h , &#952; a ) are often challenging to characterize. Prior works often assume "average" human behavior models (i.e., each DM shares the same &#952; h , &#952; a ). Our work aims to address this limitation by formally capturing the diversity among decision-making types, recognizing that effective modeling of AI-assisted decisionmaking must account for individual differences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>We propose a novel computational framework, Mix and Match (M&amp;M), to effectively characterize human behavior in AI-assisted decision-making. M&amp;M is a Bayesian approach that models the diverse ways in which humans interact with AI recommendations as a generative process involving a mixture of different types of decision-makers. Figure <ref type="figure">1</ref> illustrates the structure of the proposed model. The framework consists two main stages: 1. Mix: Modeling decisions as mixture models. In this stage, instead of a single average model, we use in total K distinct decision-making processes {f 1 , . . . , f K }. Each decision is influenced by a probability distribution over these K types, with a latent variable indicating how each specific type of decision-making process is responsible for that decision trial. 2. Match: Inferring DM types. In this stage, we match a decision trial with a distribution of DM types by inferring how likely each type is to have generated a particular decision given the observed data.</p><p>As a particular realization of the M&amp;M framework, we now introduce how we model the decision-generation process, as well as how model learning and inference can be done in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decision Generation</head><p>Modeling a Single Type of DM We start with elaborating on how a single type of DM (i.e., DM type k) generates the final decision on task x t given m(x t ; &#952; m ). Previous findings in economics and psychology have shown that people's decision-making process consists of multiple steps (Lunenburg 2010). Similarly, it has been suggested that AI-assisted decision-making may also involve multiple steps <ref type="bibr">(Cao, Liu, and Huang 2024)</ref>. Thus, consistent with previous studies, we divide the each type of DM into three steps: DM's initial judgment, the AI recommendation, and DM's aggregated decision.</p><p>Step 1: DM's initial judgment. In the first step, the human DM forms an independent judgment without AI assistance. This judgment is quantified by an independent decision model h k (x t ; &#952; hk ), which we assume follows the form of a logistic model:</p><p>This choice is consistent with previous work in wellestablished decision-making literature from economic research, where Logit models are widely used to model humans' independent decision-making, especially decisions under uncertainty <ref type="bibr">(Chapman 1984;</ref><ref type="bibr">Lovreglio, Fonzone, and Dell'Olio 2016)</ref>. Logit models and their variations are employed in modeling human behavior in advice-taking and AI-assisted decision-making as well <ref type="bibr">(Tejeda et al. 2022;</ref><ref type="bibr">Li, Lu, and Yin 2024)</ref>.</p><p>The DM's independent judgment on the task is then given by &#375;t hk = arg max h(x t ; &#952; hk ), with the confidence in this judgment being c t hk = max h k (x t ; &#952; hk ).</p><p>Step 2: AI model's recommendation. Given the AI model parameterized by &#952; m , we can compute its recommendation consisting of two parts: the prediction &#375;t m = arg max m(x t ; &#952; m ), with its confidence in this prediction being c t m = max m(x t ; &#952; m ).</p><p>Step 3: Aggregated decision. In the final step, the DM aggregates their own initial judgment and the AI recommendation to generate the final decision. Previous work in AI-assisted decision-making suggests that the cognitive process for DMs to aggregate their initial judgments and AI recommendations could further involve multiple stages <ref type="bibr">(Tejeda et al. 2022;</ref><ref type="bibr">Cao, Liu, and Huang 2024)</ref>. Specifically, to model the aggregation process g(&#8226;) while recognizing the decision-maker's goal of maximizing overall utility, we characterize g(&#8226;) through the following three stages: confidence estimation, utility calculation, and action selection, as inspired by previous studies <ref type="bibr">(Wang, Lu, and Yin 2022)</ref>.</p><p>First, the DM estimates the likelihood of the AI recommendation being correct by aggregating the confidence of the DM's independent judgment &#375;t h and the AI's recommendation &#375;t m together. This is quantified as:</p><p>Intuitively, c t h+m,k is an average of the DM's confidence in the AI recommendation and the AI's confidence in its recommendation. Higher c t h+m,k indicates that the DM estimates the AI recommendation to be more likely correct after comparing it with the DM's independent judgment.</p><p>Next, in line with the expected utility theory <ref type="bibr">(Schoemaker 1982)</ref>, we assume the DM estimates the expected utility (EU) of accepting or rejecting the AI recommendation, incorporating a parameter &#946; k that represents their perceived penalty for making a wrong decision:</p><p>After computing the utilities, the human DM needs to select an action to take. We consider the human DM will use a Logit model to compare among actions, assuming that humans are more likely to choose options with higher expected utility. Specifically, the probability for the human DM to accept the AI recommendation is given by a softmax function:</p><p>where the parameter &#948; k indicates the DM's sensitivity to utility differences. Such a Logit model is a widely-used model in economics to characterize people's discrete choices <ref type="bibr">(Adeogun et al. 2008;</ref><ref type="bibr">Train 2009)</ref>.</p><p>With the probability of DM accepting the AI recommendation r t k , we then model the action a t k of DM to accept or reject the AI recommendation with a Bernoulli distribution</p><p>In summary, the k-th decision process involves two sets of parameters: &#952; hk captures how the DM forms their independent judgement, and &#952; ak = {&#946; k , &#948; k } captures how the DM makes the aggregated decision. Together, the kth decision process is quantified by the set of parameters &#920; k = {&#952; hk , &#952; ak }.</p><p>Modeling the Mixture of K Types of DM With each type of DM parameterized by &#920; k and in total K types of DM, we define a parameter set &#920; = {&#952; hk , &#952; ak } K k=1 that characterizes a wide range of different DMs. Since the final decision is considered to be a mixture of the K types of DMs, the conditional probability of the final decision is:</p><p>where Z is a latent mixing coefficient matrix with element z t k indicating the responsibility of the k-th type of DM in a decision trial t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Learning</head><p>Our objective is to learn M&amp;M model given a training dataset of decision trials, and a set of DMs' final decisions D = {d t , &#375;t } T t=1 on these trials. Specifically, each decision trial d t consists of the decision task x t and the AI recommendation on this task m(x t ; &#952; m ). In total, the parameter space of the model has two parts, the parameters of the K types of DMs &#920;, and a mixing coefficient matrix Z.</p><p>With a known Z, learning the parameters &#920; of the model given involves computing the posterior P (&#920; | D). As direct computation is intractable, we leverage variational inference to approximate it using the parameterized distribution q &#981; (&#920;). We aim to minimize the KL divergence between q &#981; (&#920;) and P (&#920; | D):</p><p>where P (&#920;) is the prior distribution of &#920; and P (D) is a constant. Specifically, q &#981; (&#920; k ) of the k-th type of DM consists of three variational distribution families:</p><p>1. For &#952; hk (DM's independent judgment), we use a multivariate normal distribution: N (&#952; hk ; &#181; &#981; , &#931; &#981; ). 2. For &#952; ak = {&#946; k , &#948; k } (DM's aggregation model), we use a Beta distribution for &#946; (reflecting the bounded nature of the penalty parameter) and a normal distribution with a positive constraint for &#948; (reflecting the sensitivity to utility differences):</p><p>We use &#955; to denote all variational parameters in q &#981; (&#920;).</p><p>However, the coefficient matrix Z is a latent variable unknown. Therefore, similar to the approximation of posterior of &#920;, we again leverage a variational inference to approximate the distribution of {z t k } K k=1 using a parameterized distribution. Specifically, we use a Dirichlet distribution of order K, Dir(&#945; t ), to model the responsibility of the K types of DMs in each decision trial t. Without further knowledge, we use a &#945; t k = 1 K as prior. Due to the presence of the latent variables, we use the expectation maximization algorithm to optimize for the variational parameter space (&#955;, &#945;) . Firstly, in the E-step, we calculate the posterior of each z t k based on the current estimate of parameters:</p><p>Then for the Maximization step, we search for optimal parameter values to maximize the auxiliary function Q, i.e., the expectation of the complete data log-likelihood:</p><p>In each M-step, we use gradient descent to update hidden parameters to the values that locally optimize Q.</p><p>Determining the number of DM types. In practice, the number of DM types (K) is not always accessible. Domain knowledge can sometimes provide the prior of K (e.g., known categories of doctors). However, for more general cases, we leverage the Bayesian Information Criterion (BIC) <ref type="bibr">(Kuha 2004</ref>) to determine the optimal K:</p><p>where L is the maximized likelihood of the model, K is the number of parameters (including those for each DM type), and T is the number of trials in the training dataset. We train models with varying K and select the model with the lowest BIC to balance between fit and complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Inference</head><p>Given a new decision-making trial d i with decision task x i and AI recommendation m(x i ; &#952; m ), we calculate the probability of the human DM accepting the AI recommendation and predict the DM's final decision on this trial as follows.</p><p>First, with the learned K types of DM, we aim to calculate the latent mixing coefficients of the K types of DMs corresponding to the decision trial. To obtain it, we need to first obtain the parameter set &#945; i of the Dirichlet distribution parameters that generate the latent coefficients. As the direct computation is intractable, we use a heuristical method to estimate the parameters &#945; i . Intuitively, when two decision trials are similar, DMs are more likely to apply similar decision processes on them. That is, the influence of a training trial on the decision-making process of the current trial increases with its similarity to the current trial. Therefore, we approximate &#945; i using kernel-weighted parameters:</p><p>where</p><p>s(&#8226;) is the Euclidean distance, and P (&#945; i ) is the prior of &#945; i . The mixing coefficient z i is then obtained by averaging M samples from the distribution Dir( &#945;i ), with</p><p>Finally, the DM's final decision in trial i is a weighted mixture calculated by Eq. 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluation</head><p>In this section, we evaluate the effectiveness and generalizability of our proposed M&amp;M framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decision Tasks</head><p>In our evaluation, we consider three distinct real-world datasets encompassing diverse decision-making scenarios collected from previous empirical studies of AI-assisted decision-making <ref type="bibr">(Wang, Lu, and</ref> Yin 2022; Li, Lu, and Yin 2024; Vodrahalli et al. 2022): 1. Loan Risk Assessment: This dataset focuses on the task of assessing loan default risk. Participants were presented with loan applicant profiles containing seven features: loan amount, interest rate, repayment period, monthly installment, annual income, credit score, and homeownership status. The AI model provided binary recommendations (default or not) along with confidence scores. 2. Diabetes Prediction: This dataset involves predicting diabetes in patients based on demographic and medical history data. Patient profiles included six features: gender, age, history of heart disease, Body Mass Index (BMI), HbA1c level, and blood glucose level. The AI model offered binary recommendations (diabetes or not) with confidence scores. 3. Income Prediction: The decision task in the dataset is to determine a person's annual income level. Given a profile of a person with seven features-the person's gender, age, education level, marital status, occupation, work type, and working hours per week-people were asked to decide whether this person's annual income is higher or lower than 50k. The AI model provides its recommendations in the form of binary classification and the confidence score. Treatment Loan Risk Assessment Diabete Prediction Income Prediction NLL &#8595; Accuracy &#8593; F1 &#8593; NLL &#8595; Accuracy &#8593; F1 &#8593; NLL &#8595; Accuracy &#8593; F1 &#8593; Logistic Regression 0.515 0.601 0.740 0.446 0.713 0.744 0.889 0.815 0.896 Random Forest 0.721 0.602 0.715 0.472 0.692 0.738 0.999 0.826 0.903 MLP 0.665 0.599 0.724 0.554 0.734 0.751 0.704 0.757 0.854 SVM 0.558 0.646 0.651 0.461 0.754 0.758 0.958 0.652 0.753 CPT Utility 0.542 0.611 0.644 0.546 0.633 0.725 0.784 0.758 0.863 Confidence Threshold -0.600 0.656 -0.629 0.723 -0.736 0.845 M&amp;M (Ours) 0.491 0.632 0.774 0.413 0.770 0.762 0.656 0.805 0.913</p><p>Table <ref type="table">2</ref>: Comparing the performance of the proposed method with baseline methods on three decision-making tasks in terms of NLL, Accuracy, and F1-score. "&#8595;" denotes the lower the better, "&#8593;" denotes the higher the better. Best result in each column is highlighted in bold. All results are averaged over 5 runs. "-" means the method can not be applied in this scenario.</p><p>These datasets were preprocessed to ensure consistency and suitability for our analysis. For all datasets, we converted the human decisions and AI recommendations into binary format (0 or 1) and normalized the AI confidence scores to the range of [0, 1] to facilitate comparison. Table <ref type="table">1</ref> provides the summary of the datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Examining the Predictive Performance of M&amp;M</head><p>We first examine how well the M&amp;M framework can predict human DMs' final decisions in AI-assisted decision making.</p><p>Evaluation Setup For each dataset, we randomly split the data into training (80%) and test (20%) sets. To quantify model performance, we employed three key evaluation metrics: negative log-likelihood (NLL), accuracy, and F1-score. NLL measures the model's ability to predict the probability of observed human decisions, with lower values indicating better performance. Accuracy assesses the proportion of correct predictions, while the F1-score provides a balanced measure of precision and recall, capturing the model's ability to correctly identify both acceptance and rejection of AI recommendations. To ensure the robustness of evaluations, all experiments were repeated 5 times, and the average performance across these repetitions was reported.</p><p>Our M&amp;M model is trained using a Bayesian approach with variational inference, as outlined in the previous section. We experiment with different numbers of DM types (K &#8712; {2, 3, . . . , 6}) and select the K that achieves the minimum BIC score for each task. To provide a robust benchmark for evaluating the performance of our proposed M&amp;M framework, we consider three distinct classes of baseline models, each capturing different aspects of human decisionmaking behavior in AI-assisted scenarios: 1. Standard Supervised Learning Models: We employ four widely-used supervised learning models: Logistic Regression, Random Forest, Multi-Layer Perceptron (MLP), and Support Vector Machine (SVM). These models directly predict the human DM's final decision &#375;t in a decision task based on task features x t , AI recommendations &#375;t m , and AI confidence scores c t m . These models serve as a baseline for predictive accuracy, allowing us to assess whether incorporating explicit modeling of human-AI interaction patterns can improve upon standard machine learning approaches. 2. Utility-Based Model: We adapt the model proposed by <ref type="bibr">(Wang, Lu, and Yin 2022)</ref>, which is grounded in Cumulative Prospect Theory (CPT). This model assumes that DMs assess the utility of accepting or rejecting AI recommendations based on a distorted perception of probabilities, as captured by CPT's probability weighting function w(p) =</p><p>Based on this distorted estimate, the DM computes the utility of accepting or rejecting the AI recommendation as U = w(p) &#8226; gain + w(1 -p) &#8226; loss. With calculated utility, the DM then use a Logit model to select the action to accept or reject the AI recommendation. 3. Confidence Threshold Model: We include the confidence-based model used in <ref type="bibr">(Amin, Lu, and Yin 2024)</ref>, which posits that human DMs have an internal confidence threshold &#964; drawn from a distribution f (&#964; ).</p><p>If their confidence in their own judgment exceeds this threshold, they reject the AI recommendation; otherwise, they accept it. Practically, we use a Beta distribution q(&#964; ) = Beta(A &#964; , B &#964; ) to approximate the distribution f (&#964; ), with the constraint &#964; &#8712; (0, 1). This model serves as a simple yet effective baseline that captures the role of self-confidence in AI-assisted decision-making.</p><p>Evaluation Results. Table <ref type="table">2</ref> presents the performance comparison of multiple models in predicting DM's decisions across varied datasets. Overall, our proposed M&amp;M framework consistently emerges as the best-performing model with respect to NLL and F1 score. In terms of accuracy, our method performs the best in diabetes predictions and is comparable to the top-performing models in the other two decision tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Quantifying Heterogeneity in Human DMs</head><p>Beyond prediction, the M&amp;M framework offers a nuanced understanding of the heterogeneous nature of human decisionmaking in AI-assisted contexts. By explicitly modeling diverse DM types and their associated parameters, the proposed framework provides insights into the underlying fac-</p><p>Task k-th Type of DM Parameters Perceived Penalty (&#946;) Sensitivity (&#948;) Percentage in Population (&#945;) Diabetes Prediction Type I 0.81 2.41 0.26 Type II 0.92 4.72 0.74 Loan Risk Assessment Type I 0.44 3.24 0.29 Type II 0.52 4.73 0.38 Type III 0.66 7.25 0.33 AI-assisted Income Prediction Type I 0.61 1.83 0.13 Type II 0.93 4.40 0.87</p><p>Table 3: Comparisons between model parameters learned for the identified types of DMs across three datasets.  <ref type="table">3</ref> presents the learned parameters for the different types of DMs identified in varied AI-assisted decision-making scenarios. Analysis of these parameters yields several important insights.</p><p>Decision context significantly influences DM behavior.</p><p>We find that the decision context significantly influences the distribution of DM types. The proportion of each DM type, as indicated by the &#945; values, varies significantly across tasks. For example, in the diabetes prediction and income prediction tasks, Type II DMs, characterized by higher perceived penalty aversion and sensitivity to utility change, constitute the majority (&#945; = 0.74 and &#945; = 0.87 for the two tasks respectively) of the DMs. However, differences in proportions across DM types are less prominent in the loan risk assessment task, showing that the risk attitude tends to be uniformly distributed among DMs in this decision task. This suggests that the specific nature of decision context can influence the decision-making style adopted by individuals, highlighting the importance of considering context when designing AI systems that aim to assist human DMs.</p><p>Task complexity and risk perception shape diversity in DM types. The number of identified DM types and the absolute values of the perceived penalty (&#946;) parameter vary across tasks, reflecting differences in task complexity and the granularity of risk perception. The loan risk assessment task, with its three distinct DM types and relatively lower &#946; values (ranging from 0.44 to 0.66), suggests a more nuanced understanding of risk among DMs due to the availability of detailed information. In contrast, diabetes and income prediction tasks, with only two DM types each and higher &#946; values, may reflect simpler risk assessments or less available information to mitigate potential losses. This observation underscores the need for flexible and adaptable AI systems that can cater to varying levels of task complexity and individual risk perceptions.</p><p>Risk preference could act as a moderator of decision sensitivity. A consistent positive correlation is observed between perceived penalty and sensitivity across DM types within each task. For example, in the loan risk assessment task, Type III DMs, with the highest perceived penalty (&#946; = 0.66), also demonstrate the highest sensitivity (&#948; = 7.25). This finding implies that people who are more averse to incorrect outcomes (higher &#946;) are more likely to engage in analytical decision-making processes, carefully considering the potential consequences of their choices (higher &#948;). This aligns with economic theories that highlight the role of perceived risk in decision strategies under uncertainty <ref type="bibr">(Kim, Menzefricke, and Feinberg 2007;</ref><ref type="bibr">Train 2009)</ref>.</p><p>Figure <ref type="figure">2</ref> further illustrates these behavioral differences by plotting the probability of accepting AI recommendations against the aggregated confidence level for each DM type. For instance, the curves for Type II DMs consistently rise more steeply than those for Type I DMs in the interval close to 0.5, suggesting changes in aggregated confidence around 0.5 will lead to a higher change chance of decision change for Type II DMs. Furthermore, the acceptance probability of Type I DM is consistently higher than that of Type II DM when confidence is lower than 0.5, indicating that Type I DMs are generally more trusting of AI recommendations and require lower confidence levels to accept them. This observation aligns with their lower perceived penalty and sen-sitivity values, suggesting a less risk-averse decision making style. In contrast, Type II DMs exhibit a more cautious approach, requiring higher levels of confidence before accepting AI recommendations. In the loan risk assessment task, the presence of a third DM type with even higher perceived penalty and sensitivity further emphasizes the diversity of human behavior in AI-assisted decision-making scenarios.</p><p>Inferring Human DM's Independent Judgment An additional benefit of our proposed framework is its ability to infer independent human judgment without relying on explicitly labeled data. This can address a crucial limitation in prior research that employs a separate model trained exclusively to characterize the human DM's independent judgment, a process that can be resource-intensive and may not be feasible in all scenarios. To validate the efficacy of M&amp;M in inferring human DM's independent judgment, we leveraged the pilot study data from the loan risk assessment and diabetes prediction datasets. These previously conducted studies used pilot studies to collect data for human DMs reviewing tasks and making judgments without AI assistance, providing a ground truth for initial DM judgment.</p><p>Evaluation Setup. This analysis involved the pilot study data collected in the loan risk assessment and diabetes prediction tasks. Specifically, we split the pilot data into training and test sets, gradually increasing the training size from 10% to 90% of the entire pilot data. For each split, we trained an independent human decision model, as done in previous studies, and evaluated its accuracy on the test set (p idp ). Specifically, we use a random forest classifier for predictions in loan risk assessment tasks and logistic regression with diabetes data, following the approach taken by the work that collected the data.</p><p>We then used the independent human decision models {&#952; hk } K k=1 inferred by M&amp;M on AI-assisted data to predict DMs' initial judgments on the test set of pilot data. Similar to Equation 5, we use a weighted mixture of independent human decision models to predict the initial DM judgment on a specific data instance x i :</p><p>and we evaluated its accuracy (p inf ). Finally, we compared p inf with p idp across different splits, to evaluate M&amp;M's ability to accurately capture independent human judgment in scenarios with varying amounts of training data.</p><p>Evaluation Results. Figure <ref type="figure">3</ref> presents the comparison between the accuracy of independent human decision model trained on actual data of initial DM judgment without AI assistance (p idp ), and the accuracy of the initial human judgments inferred by M&amp;M (p inf ). We generally find the accuracy difference (p idp -p inf ) between the model trained on independent judgment data and our proposed model inferring independent judgments to be small, indicating comparable usefulness of the M&amp;M model without the need for additional data collection. Interestingly, when the training set size is smaller, indicating a scarcity of training data for</p><p>(a) Loan Risk Assessment (b) Diabetes Prediction the independent model, the accuracy difference is actually negative, i.e., the independent human judgment model inferred from data of AI-assisted prediction ends up outperforming the model trained on the independent human judgment dataset. These findings underscore M&amp;M's potential to unlock insights into human decision making in situations where independent judgment data is limited or unavailable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this work, we present Mix and Match (M&amp;M), a novel computational framework that models the heterogeneous nature of human decision-making under AI assistance as a mixture of distinct decision-making processes. M&amp;M acknowledges variations across different individuals and recognizes that the same individual may adopt different decision-making processes for different tasks. Our empirical evaluation on real-world data across three distinct scenarios demonstrates that the M&amp;M framework consistently outperforms baseline methods in predicting human decisions under AI assistance. Notably, the framework infers independent human judgment without the need for additional training data. Moreover, by analyzing the learned parameters of different DM types, we uncover nuanced behavioral patterns that align with established psychological theories and reveal context-dependent variations in decision-making styles. By unfolding the interplay between human intuition and AI recommendations, the M&amp;M framework paves the way for the development of more effective, personalized, and trustworthy AI systems that can more effectively empower human DMs.</p><p>It is still important to acknowledge that this study has limitations. The human behavior data used for evaluation were collected from laypeople on predictive tasks based on tabular data with relatively few features. Whether the proposed model generalizes to tasks with higher-dimensional feature spaces or greater complexity remains to be investigated. Furthermore, the AI-assisted scenarios examined explicitly provided confidence values to human DMs. Future research should explore the applicability of the M&amp;M framework to scenarios where AI models communicate confidence implicitly, such as through verbal descriptions in large language models. Finally, we assumed a logistic regression model for independent human judgment, and exploring alternative models could further enhance the framework's flexibility.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proceedings of the Twelfth AAAI Conference on Human Computation and Crowdsourcing(HCOMP 2024)   </p></note>
		</body>
		</text>
</TEI>
