<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>The Wisdom of Model Crowds</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10394341</idno>
					<idno type="doi">10.1287/mnsc.2021.4090</idno>
					<title level='j'>Management Science</title>
<idno>0025-1909</idno>
<biblScope unit="volume">68</biblScope>
<biblScope unit="issue">5</biblScope>					

					<author>Lisheng He</author><author>Pantelis P. Analytis</author><author>Sudeep Bhatia</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A wide body of empirical research has revealed the descriptive shortcomings of expected value and expected utility models of risky decision making. In response, numerous models have been advanced to predict and explain people’s choices between gambles. Although some of these models have had a great impact in the behavioral, social, and management sciences, there is little consensus about which model offers the best account of choice behavior. In this paper, we conduct a large-scale comparison of 58 prominent models of risky choice, using 19 existing behavioral data sets involving more than 800 participants. This allows us to comprehensively evaluate models in terms of individual-level predictive performance across a range of different choice settings. We also identify the psychological mechanisms that lead to superior predictive performance and the properties of choice stimuli that favor certain types of models over others. Moreover, drawing on research on the wisdom of crowds, we argue that each of the existing models can be seen as an expert that provides unique forecasts in choice predictions. Consistent with this claim, we find that crowds of risky choice models perform better than individual models and thus provide a performance bound for assessing the historical accumulation of knowledge in our field. Our results suggest that each model captures unique aspects of the decision process and that existing risky choice models offer complementary rather than competing accounts of behavior. We discuss the implications of our results on theories of risky decision making and the quantitative modeling of choice behavior.            This paper was accepted by Yuval Rottenstreich, behavioral economics and decision analysis.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Risk plays a key role in everyday choice, with managerial, financial, consumer, and health decision making often involving the evaluation of probabilistic outcomes and the optimization of value in the face of uncertainty. Unsurprisingly, understanding how people make risky choices is one of the most important research topics in the behavioral sciences, and it is a central focus of fields such as managerial decision making, behavioral decision research, judgment and decision making, and behavioral economics. Dating back to the correspondence between Blaise Pascal and Pierre de Fermat, some have argued that people should always maximize expected value in risky choice. However, early thought experiments, such as the St. Petersburg paradox proposed by Nicolaus and Daniel Bernoulli, have challenged this view by speculating what people would do, therefore putting alternative descriptive theories of decision under risk into perspective. Following these challenges, expected utility theory (EUT), pioneered by Daniel <ref type="bibr">Bernoulli (1738)</ref> and axiomatized by von <ref type="bibr">Neumann and Morgenstern (1944)</ref>, has provided an influential approach to thinking about both normative and descriptive aspects of risky choice.</p><p>Of course, research on risky choice behavior did not end with EUT. Rather, the question of what people do when confronted with options that offer potentially probabilistic outcomes has fueled a transgenerational, interdisciplinary research program, with tremendous impact both within academia and in applied settings. The first behavioral experiments designed to answer this question focused on specific deviations from EUT <ref type="bibr">(Allais 1953</ref><ref type="bibr">, Edwards 1954)</ref>. Soon several discrepancies had been uncovered, and the accumulated empirical evidence gave rise to a wave of fully fledged behavioral models (each associated with different psychological mechanisms) that could be directly contrasted with EUT in terms of descriptive adequacy (e.g., <ref type="bibr">Kahneman and Tversky 1979</ref><ref type="bibr">, Tversky and Kahneman 1992</ref><ref type="bibr">, Busemeyer and Townsend 1993</ref><ref type="bibr">, and Birnbaum 2008)</ref>. The rate at which new models have been advanced has only accelerated over the years-at the time of writing this article, several dozens of behavioral models of risky choice had been proposed <ref type="bibr">(Starmer 2000</ref><ref type="bibr">, He et al. 2020)</ref>.</p><p>Judging by the volume of models available to explain existing data, the study of people's risk-taking behavior should be one of the most mature fields in the social and behavioral sciences. What is the current state of the art in terms of describing people's behavior and predicting their choices? How much progress have we collectively achieved across disciplines, and what are the psychological mechanisms that are necessary to get good predictions? Surprisingly, it is hard to find answers to these questions. More often than not, different models are seen as competitors, where the success of a model directly discredits rival theoretical accounts. Moreover, it remains hard to assess the relative importance of different psychological mechanisms and the overall output of the collective scientific endeavor, as the study of risky choice is rather fragmented even within disciplines, and even more so across disciplines.</p><p>There are three main roadblocks hindering progress and synthesis across disciplines. First, new theoretical papers typically compare the advanced model against a handful of main competitors; as a result, it is hard to judge how a model fares against the overall state of the art in predicting and describing people's choices. Although this is a reasonable approach given the large number of potential competitors, it can lead to a splintered view of the literature and important ideas being forgotten. Second, different studies use very different data sets to evaluate the performance of models. Model performance largely depends on the selection of stimuli included in different experiments (see <ref type="bibr">Erev et al. (2017)</ref> for a similar critique), and consequently, the predictive ability of models varies across studies, making comparisons between different theoretical accounts particularly complicated. Finally, among the existing empirical studies, only a modest subset has generated enough data to allow for the estimation of model parameters of individuals, despite the fact that model parameters correspond to psychological factors, such as subjective perception, attention, and emotion, which could be highly idiosyncratic across people <ref type="bibr">(Edwards 1955</ref><ref type="bibr">, Bordalo et al. 2012</ref><ref type="bibr">, Loewenstein et al. 2015)</ref>. In the absence of such individual-level tests, our understanding of the descriptive power of many existing models is incomplete.</p><p>What is needed is a transdisciplinary analysis that comprehensively integrates the rich set of theoretical insights identified by prior researchers and uses these insights to identify the state of the art in modeling individual-level risky choice, quantify the progress made over the past several decades, understand how key psychological properties of these models relate to model performance for different data sets, and develop novel ideas for improving the predictive and explanatory scope of risky decision-making research. In this paper we hope to present such an analysis. First, we build a collection of 58 risky choice models from numerous papers published in disciplines such as management, economics, and psychology. It is important to note that we instantiate these models in code, thereby formalizing their functional forms and rigorously specifying their implementation details. To the best of our knowledge, this is the most extensive set of risky choice models compiled and implemented so far. Second, we build a collection of risky choice data sets, again drawn from different papers. Our collection includes both data sets with mixed gambles and with only positive gambles (i.e., gains) and data sets with numerous different types of choice problems (including randomly generated and experimentercurated choice problems, as well as one and two nonzero branches choice problems), allowing for a much more comprehensive evaluation of different models. Additionally, each of our data sets has a large number of responses on the individual level, facilitating individual-level model fits and tests. Overall, these data sets involve 825 individuals making 76,910 risky choices in total. Again, to the best of our knowledge, this is one of the largest risky choice data sets compiled so far. The large panel of models and the vast test bed of data sets allow for an unprecedentedly complete evaluation of different individual models. To our knowledge, we are the first to quantitatively fit many of the models, and our tests outperform the size of the data sets and model sets used in prior work by an order of magnitude.</p><p>The rich collection of models and choice stimuli that we have collected also allows us to better understand the properties of the models and choice problems that drive our results. We attempt this analysis by partitioning our set of models based on the assumed psychological mechanisms (e.g., probability weighting, regret, attention) and by partitioning our stimuli based on the correlations between the underlying probabilities and payoffs as well as the expected value (EV) difference between options. We are subsequently able to test which mechanisms lead to superior model performance and how this varies based on the underlying stimuli structure offered to participants.</p><p>Seeing models as competing against each other is a limitation in itself, and it may not do full justice to the historical accumulation of knowledge in risky choice research. A more productive approach may be to consider models as complementary and thereafter to exploit the collective wisdom accumulated in different research papers across different disciplines. Thus, in this paper, we integrate research on risky choice and research on the wisdom of crowds <ref type="bibr">(Galton 1907</ref><ref type="bibr">, Surowiecki 2004</ref>) to develop and test model crowds, where each model is seen as an expert whose judgments can be aggregated with those of other models to better describe choice behavior. Hitherto, the principle of crowd wisdom has been used to aggregate the opinions of different people in estimation and categorization tasks. Averaged opinions typically lead to more reliable estimates and in many cases outperform the predictions of the best-performing individual. In a similar vein, model crowds could leverage insights of various risky choice models and predict people's behavior better than any individual model. Moving from individuals to models is a natural step. In fact, there are often towering intellectual figures standing behind the models, and models can, in many ways, be seen as the (mathematically specified) decision rules that would be used by these experts to predict individual choice.</p><p>Model crowds hold great promise for improving our ability to predict people's behavior. In the field of machine learning, model aggregation has proven valuable for improving prediction in regression and classification tasks by efficiently leveraging small amounts of data and reducing sensitivity to specific samples (and thus reducing variance; <ref type="bibr">Breiman 1998</ref><ref type="bibr">, Polikar 2006</ref>). Thus, it comes as no surprise that ensemble models, which aggregate the predictions of several distinct models, are often proclaimed the winners of machine learning competitions <ref type="bibr">(Bell and</ref><ref type="bibr">Koren 2007, Niculescu-Mizil et al. 2009</ref>). Closer to home, ensemble models have shown great promise in a series of prediction competitions featuring models that were developed and tuned by research teams using training data from large behavioral experiments with the goal to predict the proportion of people choosing a risky option over another in a holdout data set <ref type="bibr">(Erev et al. 2010</ref><ref type="bibr">(Erev et al. , 2017) )</ref> and have been leveraged by cognitive modelers to uncover people's cognitive processes <ref type="bibr">(Singmann et al. 2018</ref>). What's more, in risky choice and other choice processes more broadly, it is reasonable to assume that individuals' decision strategies may be governed by a number of factors that are not present in any single decision model. Thus, crowds of individual models relying on different theoretical assumptions may capture these factors and thus predict individual-level behavior better than any single model does <ref type="bibr">(Payne et al. 1988</ref><ref type="bibr">, Scheibehenne et al. 2013)</ref>.</p><p>Overall, model crowds combine the insights of numerous existing models to predict and describe choice behavior, and thus they provide a measure of the progress we have collectively achieved across disciplines.</p><p>We can also quantify the contribution of each individual model in a model crowd, which can be used to identify the idiosyncratic predictive value of the model (when taking into account the predictions of other models in the crowd). When crowd models are evaluated over entire historical periods, they can be used to quantify the growth of knowledge over time, becoming a powerful tool to study the history of risky choice research. Last but not least, by calculating the average weights of models that rely on a specific psychological mechanism or by removing all models using a specific mechanism, model crowds can provide a measure of the relative importance of different psychological mechanisms in improving our predictive ability. We next test our 58 risky choice models, along with various model crowds generated from these models, as well as their assumed psychological mechanisms, to obtain a comprehensive understanding of the descriptive power of behavioral theories of risky choice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models</head><p>We collected the long list of risky choice models using a multistage process. We first searched Google Scholar using various keywords (e.g., "risky choice model" and "risky decision model") and looked for models in regular review articles published in the Annual Review of Psychology and the Journal of Economic Literature <ref type="bibr">(Edwards 1954</ref><ref type="bibr">(Edwards , 1961;;</ref><ref type="bibr">Becker and McClintock 1967;</ref><ref type="bibr">Rapoport and Wallsten 1972;</ref><ref type="bibr">Slovic et al. 1977;</ref><ref type="bibr">Einhorn and Hogarth 1981;</ref><ref type="bibr">Pitz and Sachs 1984;</ref><ref type="bibr">Payne et al. 1992;</ref><ref type="bibr">Starmer 2000;</ref><ref type="bibr">Hastie 2001;</ref><ref type="bibr">Simonson et al. 2001;</ref><ref type="bibr">Weber and Johnson 2009;</ref><ref type="bibr">Oppenheimer and Kelso 2015)</ref>. Then, using citation chaining we found additional models presented in papers citing our list of models. We then circulated these models to our colleagues using the Society for Judgment and Decision Making email listserv, who helped us identify additional models not present in our list. Finally, we manually searched through prominent journals in management, psychology, and economics, such as Management Science, Psychological Review, and American Economic Review, for recently published models that may not have been on our list.</p><p>Overall, our focus was on mathematically or algorithmically specified models of description-based risky choice with precise functional forms that could be fit to choice data. Thus, we excluded models of decision making under ambiguity (e.g., <ref type="bibr">Camerer and Weber 1992)</ref>, models of experience-based risky decision making (e.g., <ref type="bibr">Gilboa and</ref><ref type="bibr">Schmeidler 1995 and</ref><ref type="bibr">Hertwig and</ref><ref type="bibr">Erev 2009)</ref>, models of reference dependence (e.g., <ref type="bibr">K&#337;szegi and Rabin 2006)</ref>, qualitative models (e.g., <ref type="bibr">Loewenstein et al. 2001)</ref>, purely axiomatic models without restrictions on functional forms (e.g., Machina  <ref type="bibr">1982)</ref>, models of risk perception (e.g., <ref type="bibr">Pollatsek and Tversky 1970)</ref>, and models that did not have analytically specified likelihood functions and needed to be simulated to make predictions (e.g., <ref type="bibr">Erev et al. 2017)</ref>.</p><p>Despite these restrictions, we were able to collect 58 distinct models. Each of these models makes implicit or explicit assumptions about the psychological mechanisms at play in risky choice, and our large collection of models gives us an unprecedented opportunity to analyze the role of these mechanisms in model performance. After consulting the original papers of the models and identifying the mechanisms that their authors evoked when presenting the models, we categorized models as involving one or more of nine mechanisms: (1) payoff transformation, (2) probability transformation, (3) attention, (4) sampling, (5) regret, (6) disappointment, (7) ranking, (8) threshold, and ( <ref type="formula">9</ref>) dispersion (see Figure <ref type="figure">1</ref>). Models with the first and second mechanisms transform payoffs into subjective values (e.g., Bernoulli 1738) or probabilities into subjective probabilities (e.g., Edwards 1954) using nonlinear functions, and such models use these transformed payoffs or probabilities to evaluate the gambles. Models with attention (e.g., <ref type="bibr">Busemeyer and</ref><ref type="bibr">Townsend 1993 and</ref><ref type="bibr">Birnbaum 2008</ref>) assume that decision makers selectively focus on some payoffs, probabilities, or states of the world, whereas models with sampling (e.g., <ref type="bibr">Lieder et al. 2018</ref>) assume that decision makers simulate or retrieve from memory the outcomes that are used to evaluate the gambles. Models that allow for regret (e.g., <ref type="bibr">Bell 1982 and</ref><ref type="bibr">Loomes and</ref><ref type="bibr">Sugden 1982)</ref> typically compare the payoffs of a gamble against the payoffs of other gambles, whereas models that allow for disappointment (e.g., <ref type="bibr">Bell 1985 and</ref><ref type="bibr">Loomes and</ref><ref type="bibr">Sugden 1986)</ref> typically compare the payoffs of a gamble against the payoffs of the same gamble. Models that use ranking (e.g., <ref type="bibr">Thorngate 1980 and</ref><ref type="bibr">Birnbaum 1997</ref>) order the payoffs or probabilities involved and make decisions based on the ranks of these payoffs or probabilities. Models that use thresholds (e.g., Fishburn 1977 and Diecidue and van de Ven 2008) typically use discrete cutoffs for payoffs or probabilities to evaluate gambles. Models with the dispersion mechanism (e.g., <ref type="bibr">Markowitz 1952 and</ref><ref type="bibr">Weber et al. 2004</ref>) compute some measure of variability for gambles, and they typically penalize models with high variance payoffs. Of course, a given model can allow for multiple mechanisms at the same time, such as transformations of both payoffs and probabilities (e.g., prospect theory; see <ref type="bibr">Kahneman and Tversky (1979)</ref>), decision making under the influence of both regret and disappointment (e.g., <ref type="bibr">Mellers et al. 1999)</ref>, or heuristic choice with a sequence of transformation and threshold operations (e.g., <ref type="bibr">Leland 1994)</ref>. These mechanisms are nonexclusive, and a certain model may make use of more than one of them.</p><p>In order to fit stochastic choice data, we applied the logit choice rule to models that generate utilities or choice propensities on a cardinal scale. For models that generate choice propensities on an ordinal scale (e.g., heuristics), for which the logit rule was not applicable, a trembling-hand (i.e., constant-error) choice rule was applied to accommodate choice stochasticity. The two stochastic specifications are not as different as they may appear. Indeed, if we allow the logit choice rule to take an ordinal preference order as the input, it reduces to the trembling-hand choice rule (by Notes. Shaded cells mean that the model involves the psychological mechanism. Details and full names of the models can be found in the supplementary appendix.</p><p>He, Analytis, and Bhatia: The Wisdom of Model Crowds yielding a probability of choosing the preferred option and a complementary probability of making an error fixed across items). Additional details regarding the models and their implementation are provided in the supplementary appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Sets</head><p>We evaluated the models with a wide range of data sets from experimental studies. First, we downloaded data sets that have been made available online (either on personal websites or public repositories) from recently published papers with risky choice experiments. We also sent an email to the Society for Judgment and Decision Making listserv, requesting relevant data sets. All data sets were further screened to meet the following criteria:</p><p>1. Contain individual-level choice data with at least 50 choice problems for each participant (as described in the request email)</p><p>2. Offer a binary choice between monetary gambles (with explicit descriptions for both probabilities and payoffs)</p><p>3. Allow at most two possible monetary outcomes for each gamble With the above-mentioned measures taken, we obtained a total of 19 data sets (see Table <ref type="table">1</ref>). Twelve of these data sets involved gambles purely in the gain domain, including one originally presented in <ref type="bibr">Rieskamp (2008)</ref>, two in <ref type="bibr">Fiedler and Gl&#246;ckner (2012)</ref>, eight in <ref type="bibr">Stewart et al. (2015)</ref>, and one in <ref type="bibr">Stewart et al. (2016)</ref>. <ref type="bibr">Rieskamp's (2008)</ref> data set involved 30 participants making 60 binary risky choices each. The <ref type="bibr">Stewart et al. (2015)</ref> data sets involved a total of 208 participants, each of whom made either 120 or 150 binary risky choices. <ref type="bibr">Stewart et al. (2016)</ref> involved 48 participants, and each participant made 71 choices. The other seven data sets involved gambles with both gains and losses (i.e., mixed gambles), including one data set collected by <ref type="bibr">Erev et al. (2017)</ref>, one by <ref type="bibr">Pachur et al. (2017)</ref>, and five by <ref type="bibr">Pachur et al. (2018)</ref>. The <ref type="bibr">Erev et al. (2017)</ref> data set involved 60 participants making 57 binary choices each. The <ref type="bibr">Pachur et al. (2017)</ref> data set involved 122 participants making 105 binary choices each. The <ref type="bibr">Pachur et al. (2018)</ref> data sets involved 300 participants making either 91 or 51 binary choices each. Overall, the full array of data sets involved 343 participants making 38,180 choices in the gain domain and 482 participants making 38,730 choices in the mixed domain. Note that four models (the relative risk-value models; Dyer and Jia 1997) were designed exclusively for risky choice in the gain domain and thus were excluded for the analysis of mixed gambles (which involved losses). Thus, there were 58 models for gains and 54 models for mixed gambles. As such, the results from the two types of data sets are presented separately in what follows. As reported in the original papers, participants in these experiments were incentivized based on their choices in the tasks.</p><p>The data sets compiled in this paper involve a wide range of gamble designs. Some data sets have generated gambles by systematically crossing payoffs with probabilities and exhausting all possible combinations of payoffs and probabilities (e.g., <ref type="bibr">Stewart et al. 2015</ref><ref type="bibr">Stewart et al. , 2016))</ref>, whereas others have randomly selected gambles from a reasonable stimulus space <ref type="bibr">(Erev et al. 2017)</ref>. Some designs have featured items that people commonly encounter in real-world settings <ref type="bibr">(Rieskamp 2008</ref>). Yet others have followed a hybrid approach that combines manually crafted gambles with randomly generated gambles (e.g., <ref type="bibr">Pachur et al. 2018</ref>). These designs can also be understood in terms of two key quantitative properties: (1) the normalized EV difference between options and (2) the correlation between payoffs and its associated probabilities. The normalized EV difference is a choice-level property. For each binary choice between X and Y, the normalized EV difference is defined as</p><p>The correlation between payoffs and probabilities is an experimental data set-level property. It is defined as the Pearson's correlation between all involved payoffs and their associated probabilities in the experiment. Note that some researchers have intentionally controlled the normalized EV differences in the stimuli sample to allow for data-efficient model selection (e.g., <ref type="bibr">Rieskamp 2008)</ref>.</p><p>In Figure <ref type="figure">2</ref> we plot each data set in our analysis in terms of the median normalized EV difference of its component choice problems, as well as the correlation between payoffs and probabilities across all its choice problems. We see here that most of our data sets have a negative correlation between payoffs and probabilities, corresponding to a more ecologically valid design <ref type="bibr">(Pleskac and Hertwig 2014)</ref>. There is a large amount of variance in the median normalized EV difference across data sets, and only a few data sets keep this difference fixed at zero or very close to zero. There do not appear to be systematic differences between gains and mixed gambles on these two dimensions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cross-Validation</head><p>The 58 models in our collection have varying numbers of parameters and different assumptions leading to different degrees of flexibility. To control for flexibility, we used 10-fold cross-validation and evaluated the models' out-of-sample predictive performance. All the analyses were conducted at individual level. Each individual participant's choice data were divided into 10 subsets. In each iteration, 9 subsets (i.e., 90% of the choice data) served as the training set to train the models and estimate their free parameters, and the remaining subset served as the test set. The trainingtesting procedure was repeated 10 times for each participant, with each of the 10 subsets serving as the test set once. Parameters were estimated by means of maximum likelihood <ref type="bibr">(Pitt et al. 2003)</ref>. To ensure that global maximum was reached, we repeated the SIM-PLEX algorithm 500 times in the MATLAB fminsearch function and selected the maximum likelihood estimation. For a given model m, the estimated parameters in the training set were used to make predictions in the test set. For each choice problem i, the out-of-sample prediction using these parameters is denoted as &#375;m,i , which is the predicted probability that the first of the two options on the choice problem is chosen. Because each trial served in the test set exactly once in the 10-fold cross-validation, we obtained an out-ofsample prediction for every choice problem in each participant's choice data.</p><p>We evaluated models' out-of-sample predictive performance with three different loss functions. The first one is the binary prediction error. For a given model m, the individual-level prediction error is defined as</p><p>, where y i is the observed choice (1 if the first option is chosen on problem I and 0 otherwise), and N is the number of choice problems; I(&#8226;) is the indicator function that returns 1 if the argument is true and 0 otherwise. Note that in the rare cases where the prediction &#375;m,i was exactly 0.5, the indicator function I(&#8226;) was replaced with a prediction error of 0.5. The others were two probabilistic loss functions: log-loss and Brier score. The logloss is defined as LL m -1</p><p>The smaller the errors according to these loss functions, the better the model performance. The probabilistic loss functions take into account the strength of preferences and are thus more sensitive to the models' quantitative predictions than the Note. The labels of the data sets can be found in the "Abbreviation" column of Table <ref type="table">1</ref>.</p><p>He, Analytis, and Bhatia: The Wisdom of Model Crowds prediction error, which, by contrast, only encodes the direction of preference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Crowds</head><p>In addition to individual models, we built and tested five model crowds, inspired by research on the wisdom of crowds. Our model crowds took the individual models' (trained) out-of-sample predictions on the test set as given and then made novel predictions by combining the individual model predictions using some model weighting scheme. It is important to note that in designing such model crowds, we assigned weights to the models independently of the test set, which remained fully out of sample. As with the individual models, model crowds were evaluated based on their out-of-sample predictions with the same set of loss functions previously described.</p><p>The first model crowd used in our analysis was a na&#239;ve crowd that unconditionally averages out the predictions of all models for each choice problem in the test set. Specifically, for each choice problem i in the test set, the na&#239;ve crowd's predicted choice probability is the unweighted average of all individual models' choice probabilities: &#375;nc,i</p><p>, where M is the total number of individual models. Intuitively, the na&#239;ve crowd sees each individual model as being an equally valid predictor and thus aggregates the individual models without weights (as with, e.g., the equal weights heuristic decision rule; <ref type="bibr">Dawes et al. 1989)</ref>. Despite its simplicity, this model has been shown to perform quite well in forecasting opinion aggregation contexts, largely because of the robustness (low variability) of its predictions <ref type="bibr">(Hogarth 1978</ref><ref type="bibr">, Clemen 1989</ref><ref type="bibr">, Armstrong 2001</ref><ref type="bibr">, Analytis et al. 2018)</ref>.</p><p>A second model crowd was the weighted crowd. This model used differences in model performances at the training stage to inform model weights, so that that better-performing models at the training stage were given higher weights in the crowd. We used Akaike weights for this purpose <ref type="bibr">(Akaike 1973, Wagenmakers and</ref><ref type="bibr">Farrell 2004)</ref>. The Akaike weight for a model is proportional to the model's maximum likelihood in the data it is fit on (in our case, the training data), but it also includes a penalty for model complexity in terms of the number of free parameters. Accordingly, for each choice problem i in the test set, the weighted crowds' predicted choice probability is &#375;wc,i  </p><p>, where AIC m -2log L m + 2V m is the Akaike information criterion for the training set (log L m is the maximum log likelihood of the training data, and V m is the number of free parameters in m). The weighted crowd model can be seen as aggregating individual model predictions in a way that places more emphasis on the predictions of models that perform well on the training data, and thus models whose predictions are more likely to be correct in the test data (as with, e.g., the weighted additive decision rule; e.g., <ref type="bibr">Keeney and Raiffa 1993)</ref>. Similar models in the wisdom-of-crowds literature weigh the predictions of individuals based on their accuracy in prior forecasts, their self-reported confidence, or some other measure of individual-level performance, and for this reason they often outperform the na&#239;ve crowd <ref type="bibr">(Einhorn et al. 1977</ref><ref type="bibr">, Armstrong 2001</ref><ref type="bibr">, Bahrami et al. 2010</ref>). The weighted crowd can also be seen as an alternative implementation of Bayesian model averaging that penalizes model complexity by means of the number of free parameters <ref type="bibr">(Hoeting et al. 1999)</ref>.</p><p>Our third and fourth model crowds were select crowds <ref type="bibr">(Goldstein et al. 2014</ref><ref type="bibr">, Mannes et al. 2014)</ref>. As with the weighted crowd, the select crowds utilize differential model performance in the training set to determine model weights for predictions for the test set. They identify a particular number of best-performing models in the training set and assign an equal weight to all selected models. Consistent with several recent applications of select crowds, we varied the crowd size (e.g., <ref type="bibr">Luan et al. 2012</ref><ref type="bibr">, Goldstein et al. 2014</ref><ref type="bibr">, Mannes et al. 2014</ref><ref type="bibr">, Analytis et al. 2018</ref><ref type="bibr">, and Galesic et al. 2018)</ref>. Specifically, we selected either the top 5 or top 10 best-performing models in the training set for each training-testing iteration and obtained the select crowd's predictions by unconditionally averaging the predictions of the selected models. These models are referred to as "select-5" and "select-10" crowds, respectively.</p><p>Aggregating the opinions of five or near to five models or experts has been shown to lead to good results across settings <ref type="bibr">(Makridakis and</ref><ref type="bibr">Winkler 1983, Ashton and</ref><ref type="bibr">Ashton 1985)</ref>. The prediction improvement tends to diminish as additional models or experts are added to the select crowd, depending on the quality of information about past judge performance (e.g., see <ref type="bibr">Mannes et al. (2014)</ref>) and the exact nature of the problem (also see <ref type="bibr">Hogarth (1978)</ref>). The select-10 crowd allows us to investigate the sensitivity in the performance of the select crowd approach as we increase the number of included models. For the select-5 crowd, the predicted choice probability for choice problem i is </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>Individual Models Predictive Performance. The wide range of data sets we collected offer an ideal test bed for evaluating the different models. As previously discussed, we evaluated models' out-of-sample predictive performance with either prediction error, log-loss, or Brier score.</p><p>Figure <ref type="figure">3</ref> shows the models' prediction errors for gains and mixed gambles (for the results based on log-loss and Brier score, see Supplementary Figures <ref type="figure">S1</ref> and<ref type="figure">S2</ref>). For gains, the dual-systems model <ref type="bibr">(Loewenstein et al. 2015)</ref>  For mixed gambles, the leading models were two variants of cumulative prospect theory that treated gains and losses differently <ref type="bibr">(Lattimore et al. 1992</ref><ref type="bibr">, Prelec 1998</ref>). This result is consistent with earlier findings that the best variant of cumulative prospect theory has a power value function and Prelec's probability weighting function <ref type="bibr">(Stott 2006)</ref>. Other close competitors included the transfer of attention exchange model (TAX; Birnbaum 2008), odds-based subjective weighted utility theory (Karmarkar 1978), subjective expected utility theory <ref type="bibr">(Edwards 1954)</ref>, and the dual-systems model <ref type="bibr">(Loewenstein et al. 2015)</ref>.</p><p>Although the exact ranking of models in predictive performance varied with loss functions, the set of topperforming models was highly robust across loss functions. Figure <ref type="figure">4</ref> summarizes the similarities of model rankings across loss functions. The Spearman rank correlations between different loss functions were all above 0.94 for gains and all above 0.75 for mixed gambles, suggesting a high consistency across loss functions. The models' relative predictive performance in the two types of data sets was also highly consistent, with high Spearman's rank correlations according to any of the three loss functions (see also Figure <ref type="figure">4</ref>). That said, we found that the overall log-losses, Brier scores, and prediction errors were higher for mixed gambles than for gains. For example, the lowest mean prediction error achieved by a model, on average, across participants for gains was about 0.16, but the counterpart for mixed gambles was almost twice as much, at about 0.29. Nonetheless, it is premature to conclude that the models were less accurate in predicting choices in the mixed domain than in the gain domain, as other factors in the stimulus sets, such as the EV differences, might also lead to differential predictive performance.</p><p>Going beyond the mean performance of models across participants, we also examined the proportion of people for whom a certain model made the best out-ofsample predictions according to the three loss functions. This analysis revealed a high degree of heterogeneity in the best-performing models at the individual level (see Supplementary Figures <ref type="figure">S3-S5</ref> for the models' proportion of the best predictive performance across participants for the three loss functions). A large number of models accumulated no less than 2% of the best predictive performance across participants and measures for both gains and mixed gamble data sets. This indicates that although some models achieved high predictive performance, on average, there was no unequivocal best-performing model on the individual level.</p><p>Psychological Mechanisms. To assess the relative value of the nine different psychological mechanisms characterizing the considered individual models, we compared the average prediction error of all the He, Analytis, and Bhatia: The Wisdom of Model Crowds models that make use of a mechanism with models that do not. These results are summarized in Figure <ref type="figure">5</ref>. Our analysis suggests that payoff transformation is the most crucial psychological mechanism for improving our ability to predict risky choice; models using the payoff transformation mechanism had a prediction error of 0.17 with gains and a prediction error of 0.32 with mixed gambles, compared with  Another top-performing mechanism was probability transformation. Models using the probability transformation mechanism had prediction errors of 0.21 for gains and 0.33 for mixed gambles, compared with prediction errors of 0.30 (gains) and 0.37 (mixed gambles) for models without probability transformations (corresponding to a paired-sample Cohen's d of 2.05 for gains and 1.27 for mixed gambles). These differences can be seen in the light red and dark blue bars and points in Figures <ref type="figure">3</ref> and<ref type="figure">4</ref>. The attention, sampling, disappointment, and regret mechanisms often led to favorable prediction outcomes. The attention mechanism, for example, sometimes led prediction gains comparable with those from payoff or probability transformation. The ranking, threshold, and dispersion mechanisms, by contrast, did not improve prediction outcomes. Models using these mechanisms performed, on average, worse than models without these mechanisms. Of course, there was substantial interindividual variability. Even the modestly or poorly performing mechanisms describe well a considerable number of individuals. Model Crowds Predictive Performance. How well did model crowds do in comparison with the individual models? To answer this question, we first compared the model crowds to the individual models that provided the best average performance across participants using the loss function scores in the test data. As shown in Figure <ref type="figure">6</ref>, the four performance-based model crowds (i.e., the select-5, select-10, and weighted and contribution crowds) outperformed all individual models and achieved the highest overall predictive performance.</p><p>Here, the individual model performance metric labeled as "Aggregate best" is the best-performing individual model in aggregate as in Figure <ref type="figure">3</ref> and Supplementary Figures <ref type="figure">S1</ref> and<ref type="figure">S2</ref>, depending on the loss function implemented (later on, we examine individual model performance with an alternative metric, labeled "Training-contingent"). This pattern was true for both gains and mixed gambles. Although the na&#239;ve crowd did not outperform all the individual models, it still surpassed a large majority of them, lagging behind only a few individual models (implying that it  <ref type="formula">2014</ref>)). Overall, the model crowd approach can improve overall predictive performance in an out-of-sample manner. This was especially true when the aggregation strategies assigned larger weights to models that perform better at the training stage (as in the select and weighted crowds) or when aggregation strategies leverage each individual model's unique strength in predicting choice behavior (as in the contribution crowd). It is noteworthy that the advantage of model crowds over individual models was robust across all loss functions and was even more pronounced with the probabilistic loss functions (i.e., log-loss and Brier score), which were inherently more sensitive to continuous model predictions.</p><p>Not only did model crowds have better average performance across participants, they also made better predictions for a majority of participants when compared with the best-performing individual models. The rows labeled "Aggregate best" in Table <ref type="table">2</ref> show the proportion of participants for whom a model crowd made better predictions than the best individual models using the various loss functions on the test data. As can be seen here, with log-loss as the loss function, the contribution crowd made better predictions than the best individual model (which was CPT with Prelec's probability weighting function) for 72% of the participants in the gains data sets. In the mixed gambles data sets, the number went up to 88%. These patterns were robust to different loss functions, as well as different model crowds. The only exception, however, was the na&#239;ve crowd for the gains data sets. The na&#239;ve crowd did not make better predictions for a majority of participants when compared with the best individual models in gains. Its average predictive performance was also inferior to the best-performing individual models (see <ref type="bibr">Figure 5)</ref>, suggesting that in risky choice, considering different models as equally valid in the model crowd may not be the best way to leverage the collective wisdom of individual models.</p><p>Note that the model crowds weigh the different models based on their performance in the training data. This allows them to flexibly identify bestperforming models (in the training set) for each individual and use these models to make predictions. Thus, the specific weighting scheme used by a particular model crowd varies across individuals. It could be this flexibility, rather than crowd wisdom, that results in the better performance of model crowds in Figure <ref type="figure">5</ref> and Table <ref type="table">2</ref>. To ensure that this not the case, we contrasted our model crowd predictions with those of individual models with the same type of flexibility. This was done with an approach that flexibly paired each individual participant, at every split in the 10-fold cross-validation, with the best-performing model in the individual's training set, evaluated using the Akaike information criterion. This trainingcontingent algorithm can be seen as a select crowd with only the most promising model included (corresponding to a select-1 crowd). We then used this training-contingent model, to make predictions in the test sets, for each participant for every split, and we evaluated its out-of-sample predictive performance with the above-mentioned loss functions. The results of this analysis are shown in the "Training-contingent" bars in Figure <ref type="figure">6</ref> and rows labeled "Training-contingent" in Table <ref type="table">2</ref>. Here, we can see that the training-contingent approach actually reduces predictive performance relative to the fixed individual models that provide the best predictions across participants (likely because of the high variance of this algorithm). Thus, the advantage of model crowds over individual models cannot be attributed to their flexibility in identifying the best individual model in a particular split of the training data. Rather, it is likely because of crowd wisdom, which exploits the complementarities of different models that make up the crowd. Finally, the four performance-based model crowds had an obvious advantage over the na&#239;ve crowd that treated all individual models as equally valid; performance-based crowds can leverage the differential predictive power of individual models. The four performance-based model crowds achieved roughly the same predictive ability, with prediction error and Brier score as loss functions. However, with log-loss as the loss function, the contribution crowd tended to provide the best overall predictive performance for both gains and mixed gambles (see Figure <ref type="figure">6</ref>). Our analysis of model weights in the next section will unpack potential causes of the contribution crowd's superior predictive performance. The historical analysis that follows also illustrates an additional strength of the contribution crowd: it can successfully aggregate model predictions regardless of the number and performance variance of the models present in the model pool. This will become apparent when looking at historical time windows where low-performing models are overrepresented (see Figure <ref type="figure">9</ref> for more details).</p><p>Weights in Model Crowds. The distribution of model weights differed across model crowds. To examine this, we calculated for each participant a measure of weight dispersion in each model crowd using the Gini coefficient, a canonical measure of dispersion and inequality in distributions. If all weights concentrate on one single model, the Gini coefficient will be 1, meaning that there is a minimal amount of dispersion. By contrast, in the na&#239;ve crowd where each model receives an equal weight, we have a Gini coefficient of 0, meaning a maximal amount of dispersion. The dispersion of model weights in the four performance-based crowds lies in between. As in Figure <ref type="figure">7</ref>, select and weighted crowds mostly concentrate on a few models, with the mean Gini coefficients between 0.8 and 0.9, whereas the contribution crowd strikes for a more balanced dispersion of model weights, with Gini coefficients of about 0.5.</p><p>The success of model crowds can be also understood in terms of the distribution of model weights. Model crowds trade off between assigning larger weights on the best-performing individual models and hedging their bets by dispersing the weights more across different models <ref type="bibr">(Davis-Stober et al. 2014</ref><ref type="bibr">, M&#252;ller-Trede et al. 2017)</ref>. The training-contingent model (which would correspond to the select-1 crowd) and the na&#239;ve crowd represent two boundary solutions to this trade-off. The former goes all in and adopts the prediction of the bestperforming model in the training set (leading to a Gini coefficient of 1), whereas the latter is maximally diverse and unconditionally averages the predictions of all the models, regardless of model performance in the training set (leading to a Gini coefficient of 0). Yet, as shown in Figure <ref type="figure">6</ref>, neither of the two boundary solutions performed as well as the four performance-based model crowds, the latter of which struck for a more balanced distribution of model weights. The contribution crowd, in particular, has been the best-performing model in many of our tests using probabilistic loss functions in both gains and mixed gambles. This result indicates that good crowd solutions may, in fact, leverage a quite diverse crowd of models (i.e., Gini coefficient of 0.5 for the contribution crowd; also see <ref type="bibr">Hong and Page (2004)</ref> and <ref type="bibr">Lamberson and Page (2012)</ref>).</p><p>The weights assigned to individual models in the model crowds also allowed us to measure the degree to which different models contributed to the crowd predictions. Figure <ref type="figure">8</ref> displays the weight each individual model received in the contribution crowd, the performance-based crowd that makes the best use of model diversity (see Supplementary Figures <ref type="figure">S6-S8</ref> for selected and weighted crowds). As expected, the top contributors were often the models that did very well in the individual model comparison. Moreover, the fact that all models (except the random model) made nonzero contributions in both the gains and the mixed data sets indicates that each existing model captures some unique features of choice behavior. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Historical Trends</head><p>We also evaluated how predictive accuracy of risky decision models evolved historically. Our historical analysis started from the year 1950, at which point there were only three models (the baseline random model, expected value theory, and expected utility theory), and extended until the year of 2018, at which point, all the models involved in the current analysis had been published. We first evaluated the performance of the best-performing individual model at each point in time. This has been relatively stable over most of the historical timeline. For gains, expected utility theory was the best-performing model before 1950, until the advent of SEU <ref type="bibr">(Edwards 1954)</ref>. Afterward, there were minor improvements in predictive accuracy with the odds-based subjective weighted utility theory formulated by <ref type="bibr">Karmarkar (1978)</ref>, prospective reference theory <ref type="bibr">(Viscusi 1989)</ref>, two variants of cumulative prospect theory <ref type="bibr">(Lattimore et al. 1992</ref><ref type="bibr">, Prelec 1998)</ref>, and the dual-systems model <ref type="bibr">(Loewenstein et al. 2015)</ref>, depending on the loss function implemented (see Figure <ref type="figure">9</ref>).</p><p>For mixed gambles, expected utility theory was also the best model at the beginning. However, soon it was supplanted by portfolio theory <ref type="bibr">(Markowitz 1952</ref>) and then again by subjective expected utility theory, which led to a big leap in predictive performance. This model remained the best-performing model for more than two decades until odds-based subjective weighted utility theory was introduced. A significant historical leap came with the introduction of models that treated gains and losses differently, such as cumulative prospect theory and the TAX <ref type="bibr">(Birnbaum 2008)</ref>. Again, there are minor differences based on the specific loss function used.</p><p>We also evaluated the performance of our model crowds at each historical time point. As can be seen in Figure <ref type="figure">8</ref>, the contribution crowd outperformed the best individual model available for nearly all time points (regardless of the number and the composition of models involved in the crowd). This was true for both gains and mixed gambles, and the advantage of Notes. For the model crowds, all models formulated up to the year reported on the x axis were used to calculate the crowd predictions for that time point. CPT, cumulative prospect theory; EU, expected utility; Portfolio (VAR), portfolio theory with variance; PRT, prospective reference theory; SEU, subjective expected utility; SWU, subjective weighted utility; TAX, transfer of attention exchange. For the full list of model abbreviations, see Table <ref type="table">A</ref>.1 of the supplementary appendix.</p><p>He, Analytis, and Bhatia: The Wisdom of Model Crowds the contribution crowd was even more pronounced for mixed gambles. These results again show that aggregation algorithms that successfully exploit each model's strength in predicting idiosyncratic individuallevel data, while hedging their bets across different models, can reliably predict risky choice behavior better than individual models.</p><p>Model crowds other than the contribution crowd showed slightly different patterns (the historical trends of all model crowds can be found in Supplementary Figure <ref type="figure">S9</ref>). These crowds underperformed in early time periods, when only a limited number of models were available. This was especially the case for the na&#239;ve and select crowds, which assigned equal weights to all models or to subsets of models used in the crowd. The weighted crowd was more robust to the effects of small crowds. However, with the introduction of newer behavioral models, the performance of model crowds greatly improved, and all model crowds outperform the best individual model from the 1970s onward. Overall, although being able to leverage crowd wisdom, the select and weighted crowds appeared to be more sensitive to the composition of the model pool than did the contribution crowd.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Psychological Mechanisms in Model Crowds</head><p>Not only can model crowds improve performance; they can be also used to assess the relative importance of different psychological mechanisms for the study of risky choice. The first way to achieve this is to evaluate the average weights of models that have a specific mechanism and compare them with the average weights of models that do not have this mechanism. This analysis reveals that models with payoff transformation have much larger weight in the contribution crowd than models without it. The difference in average weights was pronounced for models with the attention, probability transformation, and disappointment mechanisms and moderate for the sampling and regret mechanisms (Figure <ref type="figure">10</ref>). By contrast, the average weights of models with the threshold, ranking, and dispersion mechanisms are lower than those models that do not have these mechanisms.</p><p>A second approach to assess the relative impact of each of the nine psychological mechanisms on prediction is to remove all the models that involve the psychological mechanism from the contribution crowd (removing mechanisms one at a time). This is a process similar to the historical analysis of the contribution crowd (see the dotted blue line in Figure <ref type="figure">9</ref>), but this time, models are filtered at the mechanism level. As in Figure <ref type="figure">11</ref>, removing models with payoff transformation substantially increased the prediction error in the contribution crowd, compared with the model crowds using all the models as in Figure <ref type="figure">5</ref>. The same happened, but to a lesser extent, when models belonging to the probability transformation mechanism were removed. By contrast, removing any other psychological mechanism did not appear to significantly influence the crowd's predictive performance. The results from both these analyses are largely consistent with the individual-level analysis of the different mechanisms, which identified payoff and probability transformations as the two most important mechanisms for improving predictive performance, followed by attention, sampling, disappointment, and regret.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact of Experimental Designs</head><p>Finally, we ran a sensitivity analysis by examining the extent to which the results varied across data sets. The relative rank of models' prediction errors was highly consistent across data sets, with high Spearman rank correlation &#961; (median 0.88, mean 0.90). This He, Analytis, and Bhatia: The Wisdom of Model Crowds suggests that there is converging evidence across data sets with regard to the models' predictive performance. The same holds true when considering the contribution of different mechanisms to predictive performance. To further bolster this point, we calculated a paired-sample Cohen's d for each mechanism by comparing the average predictive performance of models that used the mechanism to that of models that did not use the mechanism, for each data set. This is shown in Figure <ref type="figure">12</ref>, which displays the distribution of Cohen's d across data sets using different loss functions. Mechanisms such as payoff transformation, probability transformation, attention, sampling, disappointment, and regret reliably boost predictive performance across data sets for gains, whereas threshold, ranking, and dispersion show ambiguous patterns. The patterns for mixed gambles were similar, except that the disappointment and regret mechanisms became less productive for mixed gambles than for gains. The predictive advantage of model crowds over individual models also persisted in 13 out of the 19 data sets (68.4%), even if we allowed each data set to be paired with its own best-performing individual model. Overall, the results discussed in the previous sections were corroborated across most data sets. Yet there were still small differences that can be attributed to specific experimental designs. This can be in part seen in the distribution of Cohen's d in Figure <ref type="figure">12</ref>. To further illustrate this, we map out the data sets on a two-dimensional plane using multidimensional scaling (MDS). Specifically, we calculated a Spearman rank correlation &#961; between each pair of data sets in terms of the models' predictive performance and then used 1 -&#961; as the distance measure for MDS. As shown in Figure <ref type="figure">13</ref>, there appears to be a gap between gains and mixed-gamble data sets. This gap is largely driven by manually created data sets with one nonzero branch in the choice (i.e., <ref type="bibr">Stewart et al. 2015 [SRH15]</ref> and <ref type="bibr">Stewart et al. 2016 [SHM16]</ref>) and does not appear with data sets that involved randomly generated gambles involving two nonzero branches. The gain data sets that have two nonzero branches (i.e., <ref type="bibr">Fiedler and Gl&#246;ckner 2012 [FG12]</ref> and <ref type="bibr">Rieskamp 2008 [Rieskamp08]</ref>) are closer to the mixed-gamble data sets (which involve randomly generated gambles and have two nonzero branches in the gamble) than to the gain data sets with manually created one-branch gambles.</p><p>Among the mixed-gamble data sets, there was a notable difference between data sets from experiment 2 of <ref type="bibr">Pachur et al. (2018)</ref> and others. Specifically, in the experiment 2 data sets of <ref type="bibr">Pachur et al. (2018)</ref>, heuristic models such as the better-than-average heuristic, the minimax regret heuristic, and the equiprobable heuristic <ref type="bibr">(Thorngate 1980</ref>) performed reasonably well, whereas in other data sets the same heuristic models predicted poorly. This is likely because the design of the stimuli was favorable to these heuristic models. For example, some choice items in this experiment were simply rejecting/accepting a gamble with equal odds of winning and losing. For such items, heuristic models such as the equiprobable heuristic can mimic many utility-maximizing models while being more parsimonious. This analysis reveals that although the results are remarkably stable across data sets, specific design choices may still have an impact on the relative performance of different decision models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>For several decades, researchers have been searching for a model to describe and explain risky choice. This Notes. The labels of the data sets can be found in the "Abbreviation" column of Table <ref type="table">1</ref>. D1 and D2 represent the two dimensions of the multidimensional scaling solution respectively.</p><p>He, Analytis, and Bhatia: The Wisdom of Model Crowds effort has resulted in dozens of mathematically distinct models that have their origins in several scientific disciplines. Yet there has been little consensus with regard to the state of the art in terms of predictive or descriptive performance; different papers often compare model performance on different data sets and assess the performance of only small subsets of "rival" models. Different models are commonly seen as competitors, with the success of one model undermining other theoretical accounts. As things stand, it is hard to assess the accumulated wisdom on risky choice from a decades-long multidisciplinary research endeavor.</p><p>Our article hopes to address some of these issues using a very large-scale model comparison. For this comparison we complied a panel of 58 existing risky choice models and compared their performance using a comprehensive test bed of 19 existing risky choice data sets that involved over 800 participants. Furthermore, drawing on insights from the wisdom-ofcrowds literature, we tested the predictions of model crowds that aggregate the predictions of individual models.</p><p>This analysis uncovered a number of novel results regarding the predictive potential of risky choice models and model crowds. First, the best-performing models fell into the category of nonexpected utility theories and were often variants of prospect theory with both nonlinear transformations of payoffs and probabilities <ref type="bibr">(Edwards 1955</ref><ref type="bibr">, Karmarkar 1978</ref><ref type="bibr">, Lattimore et al. 1992</ref><ref type="bibr">, Prelec 1998)</ref>. Other models such as the dual-systems model <ref type="bibr">(Loewenstein et al. 2015)</ref> and TAX <ref type="bibr">(Birnbaum 2008</ref>) also made good predictions. It is worth noting that there was substantially individual-level variability, and most models did well for at least a few participants. Second, model crowds, and especially crowds that wisely leveraged the diversity of the model pool, substantially improved predictions across different measures and in most data sets. Model crowds also provided a novel quantitative methodology for tracking historical accumulation of knowledge and for identifying key psychological mechanisms in risky choice modeling. Finally, the vast number of data sets allowed us to examine the important yet elusive impact of experimental designs on model selection. Although model performance strongly correlated across different data sets, our exploratory analysis revealed that design choices such as gains versus mixed gambles, randomly generated versus manually curated items, and one-branch versus two-branch gambles moderated, to some degree, the relative performance of the competing models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The Predictive Power of Model Crowds</head><p>Human behavior is highly idiosyncratic such that a model that works well for one individual may do poorly for another. Furthermore, the decision rules that guide choice are inherently noisy, reflecting fluctuations in various cognitive, affective, and contextual variables <ref type="bibr">(Busemeyer and</ref><ref type="bibr">Townsend 1993, Bhatia and</ref><ref type="bibr">Loomes 2017)</ref>. People may also switch between decision rules depending on the nature of decisionmaking problem at hand, a behavior commonly referred to as strategy selection <ref type="bibr">(Payne et al. 1988, Lieder and</ref><ref type="bibr">Griffiths 2017)</ref>. For example, they may rely on utility maximization in some problems but switch to heuristics in others. Alternatively, different strategies may even interact in a single-choice problem, a phenomenon commonly referred as strategy blending (see <ref type="bibr">Erickson and Kruschke (1998)</ref>, <ref type="bibr">Plonsky et al. (2017), and</ref><ref type="bibr">Herzog and</ref><ref type="bibr">von Helversen (2018)</ref>). Thus, trying to identify the one individual model that people use might not be the most productive approach when we want to predict people's behavior.</p><p>The model crowd approach outlined in this article seeks to accommodate a multitude of models that take diverse theoretical perspectives (for a similar take on the social sciences, see <ref type="bibr">Smaldino (2017)</ref>). Of course, we do not assume that decision makers deliberate exactly like our model crowds. Rather, model crowds allow researchers to approximate the diversity of human mental processes, which results in improved performance. Indeed, the best-performing model in much of our analysis was the contribution crowd, which is a crowd model that relies on considerable model diversity, as assessed by the dispersion of the model weight vector.</p><p>The model crowds' superior predictive performance can also be understood in terms of the biasvariance trade-off (see <ref type="bibr">Geman et al. (1992)</ref> and <ref type="bibr">Gigerenzer and Brighton (2009)</ref>). The total error of predictive models in machine learning, statistics, and cognitive science can be decomposed into three error components: bias, variance, and irreducible noise. Model crowds (or ensembles in machine learning) drastically reduce the variance component of prediction error <ref type="bibr">(Breiman 1996)</ref>, thereby improving overall prediction. This is especially the case in the presence of interindividual variability, as in our data sets. Going with the single best-performing model would lead to good performance if we were able to identify the right model for each individual ahead of time. However, matching an individual to the best-performing model is a challenging problem <ref type="bibr">(Davis-Stober et al. 2014)</ref>. This is clearly illustrated in the case of the training-contingent model: simply selecting the bestperforming model in the training set often does not lead to the best predictions in the test set. Although this approach scores low on bias, it suffers from high variance, as it is sensitive to the specific sample that was used to find the best-performing model. The model crowds strike a much better balance on the bias-variance trade-off by substantially reducing the risk of going all in on a single model without putting much (or any) weight on others (see the supplement of <ref type="bibr">Analytis et al. (2018)</ref> for further discussion). In sum, model crowds provide a statistically reliable approach that allows for the capture of interindividual variability in risky choice.</p><p>There are similar successful applications that harness collective model wisdom in decision analysis. <ref type="bibr">Scheibehenne et al. (2013)</ref>, for example, formulate the metaphor of the heuristic toolbox in a hierarchical Bayesian framework and show that by incorporating multiple heuristics, the toolbox explains behavioral data better than a single heuristic. Another example is recent risky choice prediction competitions, in which researchers were provided with ample training data and were challenged to develop new modeling approaches or existing behavioral or machine learning models that could predict the proportion of people making a risky choice in a held-out test set. The most successful models in these competitions were ensembles or hybrid models encompassing insights from several decision strategies <ref type="bibr">(Erev et al. 2010</ref><ref type="bibr">(Erev et al. , 2017;;</ref><ref type="bibr">Plonsky et al. 2017)</ref>. Our paper extends this line of analysis by including all previously proposed risky choice models that can be translated to computer code, generating predictions at the individual level, and using them as elements to construct model crowds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Historical Perspective</head><p>Our framework also provides a historical window onto the evolution of the field of risky choice modeling, from the axiomatization of expected utility theory by von Neumann and Morgenstern to the sophisticated behavioral models of the present day. We find an increase in the rate at which new models have been introduced in the pool of available models but diminishing returns in overall predictive accuracy. In fact, for some measures and data sets, it has been more than a decade since a new model has outperformed the best-performing model up to that point. A different picture emerges when we look at crowd models: instead of a stagnating field, we see a field with rapid improvement and continual progress. The introduction of new models adds to our collective ability at predicting risky choices across measures for both gains and mixed gambles.</p><p>We can also use our framework to look at the importance of specific models over time. Prospect theory, arguably the most prominent behavioral decision model, performed only modestly for both gains and mixed gambles in our model comparison. In fact, subjective expected utility, formulated by Edwards in 1954, and odds-based subjective weighted utility published by <ref type="bibr">Karmarkar in 1978</ref>, almost contemporaneously with prospect theory, outperformed prospect theory in terms of predictive performance both for gains and for mixed gambles. These models share their core assumptions with prospect theory (such as nonlinear transformations of both payoffs and probabilities). Nonetheless, models that were later derived from prospect theory outperformed these earlier models and were often among the top contestants. This is especially true for mixed gambles, in which models derived from prospect theory excel because of the assumptions of loss aversion and differential probability weights for gains and losses. Thus, although the original prospect theory was never historically the best-performing model, the new concepts that were introduced in the field with prospect theory had a long-lasting impact and eventually led to improvements in our ability to predict risky choices. This is a common motif in science: ideas need to be further refined and elaborated on to reach their full potential.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Promising Psychological Mechanisms</head><p>Our large-scale model comparisons identified the psychological mechanisms that yield good predictive performance in risky choice. Payoff transformation was by far the most important mechanism, followed by probability transformation. Thus, it comes as no surprise that the best-performing models, including the variants of prospect theory previously mentioned, fall into the category of nonexpected utility theories, which use some nonlinear function to transform crude payoffs to subjective values and, in addition, transform objective probabilities to subjective probabilities. This pattern emerges in both gains and mixed gambles. Payoff transformation, in particular, always improved model performance, regardless of the other model characteristics involved. The payoff and probability transformation mechanisms were followed by the attention, sampling, disappointment, and regret mechanisms. These three mechanisms have been used often in recent years and show some promise in their potential to improve our ability to predict risky choice, especially when combined with payoff and probability transforms.</p><p>Results on the relative importance of different psychological mechanisms for prediction were replicated in model crowds. Specifically, we tested the predictive value of different psychological mechanisms (i) using the weights in model crowds and (ii) by removing all the models using a mechanism from the contribution crowd. Once again, subjective payoff and subjective probability transformation mechanisms stood out as key mechanisms for improving predictive performance in risky choice. Predictive performance dropped substantially when models using these mechanisms were removed from the contribution crowd. The value of other mechanisms is more modest, but our analysis of average weights in model crowds suggests that it is always nonnegligible, especially for models using the attention, sampling, and disappointment mechanisms. That said, it is <ref type="bibr">He, Analytis, and Bhatia:</ref> The Wisdom of Model Crowds important to note that the exclusion of each of these individual mechanisms from the crowd did not substantially impact performance. Unlike payoff and probability transformation, the predictions of the individual attention, sampling, and disappointment mechanisms can be mimicked by a combination of other psychological mechanisms.</p><p>Using the models in their original forms, our psychological mechanism analysis reflects each mechanism's contribution in the research enterprise of risky decision making. However, because the co-occurrence of psychological mechanisms was not systematically varied, we were unable to disentangle their contributions independent of the historical context. Although beyond the scope of our paper, such an analysis may be possible using more sophisticated techniques. For example, in the domain of multialternative multiattribute choice, <ref type="bibr">Turner et al. (2018)</ref> use the switchboard technique, where each mechanism can be turned to different states (e.g., ON or OFF) to create compositional models and thus systematically investigate the core psychological mechanisms underlying multiattribute choice. Such techniques can potentially provide even better estimates of the contributions of different mechanisms in the domain of risky choice, and we hope that our work will inspire further research in this topic in the near future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Future Work</head><p>We have attempted a large-scale test of different risky decision models (and their corresponding psychological mechanisms) on a diverse set of experimental data sets. This approach is increasingly necessary given the growth of risky decision-making research over the past few decades. Our model crowd approach also provides a promising way to model and analyze choice behavior by synthesizing the insights generated by dozens of models and allows us to quantitatively track the historical evolution of the field. The statistician George Box wittingly proclaimed that "all models are wrong but some are useful" <ref type="bibr">(Box 1979, p. 202)</ref>. With the contribution crowd, we can evaluate models, either new or old, with regard to their unique contribution to the crowd, thus identifying which models are "useful" and to what degree.</p><p>Although the results of this analysis are largely robust across different data sets, and for different stimuli samples, in some cases the strength of certain models did depend on the design of gambles. An example is experiment 2 of <ref type="bibr">Pachur et al. (2018)</ref>, whose design choice favored heuristic models such as the better-than-average, minimax-regret, and equiprobable heuristics. These heuristic strategies do not involve the essential mechanisms of payoff and probability transformation and are thus unlikely to generalize to other settings (such as those involving randomly generated gambles). Future work can use our paradigm to better understand the effect of design choice on model behavior and model discrimination (see <ref type="bibr">Navarro et al. (2004)</ref>; <ref type="bibr">Wagenmakers et al. (2004)</ref>; <ref type="bibr">Myung and Pitt (2009)</ref>; <ref type="bibr">Cavagnaro et al. (2013</ref><ref type="bibr">Cavagnaro et al. ( , 2016))</ref>; and <ref type="bibr">He et al. (2020)</ref> for additional discussions). Additionally, although we have used an extremely large set of existing data sets to analyze model performance, all decision problems used in our analysis involve binary two-branch risky choices with full information. In the future, our approach could be extended to additional data sets or types of problems. For example, researchers could examine how models perform in settings where more than two risky options are available <ref type="bibr">(Venkatraman et al. 2014)</ref> or when the decisions are made under ambiguity (e.g., <ref type="bibr">Ellsberg 1961)</ref>.</p><p>We have addressed the theoretical and methodological challenges involved in modeling risky choice. These challenges are also common in other domains of decision research, such as intertemporal choice, decisions from experience, multiattribute choice, social decision making, and strategic decision making (e.g., <ref type="bibr">Frederick et al. 2002</ref><ref type="bibr">, Hertwig et al. 2004</ref><ref type="bibr">, Herzog and von Helversen 2018</ref><ref type="bibr">, and Golman et al. 2020)</ref>. In each of the domains there are a large number of competing behavioral models and preexisting experimental data sets with considerable individual-level data. The largescale model evaluation and model crowd approaches showcased in this paper can also be used to assess the state of the art in these areas and to further improve researchers' ability to predict people's choices. We look forward to future work that builds on the numerous existing theories and extensive empirical data in decision science, in order to provide a cumulative, transdisciplinary perspective on human choice behavior.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Management Science, Articles inAdvance, pp. 1-26, &#169; 2021 INFORMS   </p></note>
		</body>
		</text>
</TEI>
