<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Value‐Based Model Selection Approach for Environmental Random Variables</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10095048</idno>
					<idno type="doi">10.1029/2018WR023000</idno>
					<title level='j'>Water Resources Research</title>
<idno>0043-1397</idno>
<biblScope unit="volume">55</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>M. F. Müller</author><author>S. E. Thompson</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Goodness-of-fit metrics do not reliably indicate the added value of model predictions to environmental decisions• Sampling uncertainty on validation data affects the reliability of model evaluation results• The proposed approach assigns value to prediction uncertainties by estimating the costs of ensuing suboptimal decisions]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Many environmental variables are best described as stochastic processes varying randomly through time. The probability distributions of these random variables, and our ability to predict them, has important societal implications. For instance, long term investments in mitigation efforts for natural hazards are typically made based on the predicted probability of such hazards occurring <ref type="bibr">[Rose et al., 2007]</ref>; power infrastructure design is conditioned on the probability distributions of wind or streamflow <ref type="bibr">[Carta et al., 2009;</ref><ref type="bibr">Basso and Botter, 2012]</ref>; and environmental regulation to protect ecosystem services is often determined by streamflow probability distributions [e.g., <ref type="bibr">Arthington et al., 2006]</ref>. Over shorter time horizons, probabilistic forecasts of weather or streamflow inform day-to-day operational decisions related to irrigation, reservoir releases, water navigation and so on <ref type="bibr">[Krzysztofowicz, 2001]</ref>. (Note that we use the terms "prediction," which is commonly applied to long-term, usually irreversible, decisions, and "forecast" which is often applied to day-to-day operational decisions, interchangeably in this paper because the theoretical arguments presented are valid for both).</p><p>Empirical knowledge about the probability or cumulative density functions (PDF or CDFs) of many environmental variables is imperfect, because rare events cannot be systematically observed, motivating the use of models to predict stochastic variables and inform decisions. Although the complexity and accuracy of such prediction models is increasing [e.g., <ref type="bibr">Maxwell et al., 2014]</ref>, the availability of in situ observations needed to parameterize and validate complex models may constrain their use. Alternatively, a growing body of work attempts to predict streamflow CDFs with limited sampling data [e.g., <ref type="bibr">Botter et al., 2008;</ref><ref type="bibr">M&#252;ller et al., 2014;</ref><ref type="bibr">Muneepeerakul et al., 2010;</ref><ref type="bibr">Castellarin et al., 2004a;</ref><ref type="bibr">Doulatyari et al., 2015;</ref><ref type="bibr">Dralle et al., 2016;</ref><ref type="bibr">Mej&#237;a et al., 2014;</ref><ref type="bibr">Dralle et al., 2017]</ref>. The dichotomy between highly resolved but data-hungry models, and simpler models needing less input information frames an urgent model evaluation problem in hydrology and beyond <ref type="bibr">[Jakeman et al., 2006]</ref>.</p><p>Model evaluation traditionally relies on summed squared errors (SSE) metrics, including the Nash Sutcliffe Efficiency (NSE) <ref type="bibr">[Nash and Sutcliffe, 1970]</ref> and many others [see e.g., <ref type="bibr">M&#252;ller and Thompson, 2016;</ref><ref type="bibr">Castellarin et al., 2004b</ref>, for applications to streamflow CDF]. SSE metrics are easy to implement and quantify the extent to which a model can be reconciled with observations in terms of its statistical efficiency (i.e. their relative distance to the Cram&#233;r Rao Lower Bound). This knowledge is critical to improve process understanding, generate and test hypotheses, and calibrate, validate and diagnose models <ref type="bibr">[Gupta et al., 2008]</ref>, and number of promising new approaches have been recently proposed, notably drawing on Bayesian inference [e.g., <ref type="bibr">Marshall et al., 2005;</ref><ref type="bibr">Thyer et al., 2009]</ref> and information theory [e.g., <ref type="bibr">Weijs et al., 2010;</ref><ref type="bibr">Gong et al., 2013;</ref><ref type="bibr">Nearing and Gupta, 2015]</ref> to evaluate probabilistic predictions where SSE metrics are problematic. These approaches do not, however, necessarily provide an evaluation metric that identifies which model will support the best decision-making by an end user. Specifically, the above approaches incur one or more of the following three issues when it comes to evaluating how well models support decisions. The first challenge relates to the interpretability of SSE-based metrics in particular, which convey little information about the risks or consequences of decision-making given the experienced model uncertainty <ref type="bibr">[Weijs et al., 2010]</ref>. Since decision-making is improved when decision-makers have information about the uncertainty of model predictions <ref type="bibr">[Ramos et al., 2013]</ref>, there would be value in adopting tangible measures of this uncertainty. The second issue arises from improper use of error metrics in model evaluation practice, where the effect of finite sample sizes on the reliability of the evaluation tends to be overlooked. For instance, flow frequency distribu-Confidential manuscript submitted to Water Resources Research tion models are often evaluated against the moments and/or particular quantiles of empirical distributions constructed using a finite sample of observations. Under these conditions, comparing model performances across sites without accounting for the sampling uncertainty on the validation dataset (e.g., comparing model performance in basins where 10 vs. 50 years of data are available to construct the validation CDF) is problematic and yet fairly common in practice [e.g., <ref type="bibr">M&#252;ller and Thompson, 2016;</ref><ref type="bibr">Castellarin et al., 2004b;</ref><ref type="bibr">Mohamoud, 2008;</ref><ref type="bibr">Li et al., 2010;</ref><ref type="bibr">Booker and Snelder, 2012]</ref>. Failure to account for validation sample sizes is particularly problematic for applications where sampling uncertainty is large, such as extreme events <ref type="bibr">[Pushpalatha et al., 2012]</ref>. The third item is a fundamental limitation of model evaluation approaches that do not incorporate the intended use of the model. Good decisions often depend on correctly predicting a discrete value, such as the probability associated with a specific flow threshold. This critical threshold may be endogenous to the decision and a priori unknown, which precludes the use of SSE-related metrics such as skill scores or the Nash Sutcliffe Efficiency. For example in the context of run-of-river hydropower design (Sections 4.2 and 5.2), the critical flow threshold for model evaluation is itself determined by a design optimization that takes the modeled flow distribution as input.</p><p>Value-based metrics measure the marginal benefit that models provide to end-users by enabling better informed decisions. For instance, the Value of Perfect Information has been proposed by several studies as a metric to evaluate environmental model predictions <ref type="bibr">[Murphy, 1969;</ref><ref type="bibr">Epstein, 1969;</ref><ref type="bibr">Laio and Tamea, 2007]</ref>. The approach compares the outcome of a decision made using model predictions to an idealized situation where the decision is taken with perfect knowledge. Specifically, a cost function is assigned to prediction uncertainties, and the expected (monetary) outcomes of decisions made based on modeled vs. perfect information are compared: the lower the difference (defined as "expected costs" in <ref type="bibr">Laio and Tamea [2007]</ref>), the more valuable the forecast. Value of Perfect Information approaches address the endogeneity and interpretability problems discussed above. Because they explicitly account for cost (or utility), they are sensitive to the component of a model output used to make the decision at hand. Despite practical and epistemological limitations [see <ref type="bibr">Weijs et al., 2010</ref>, for an extensive discussion], they also have the merit of providing a transferable evaluation metric (monetary value) with a clear interpretation, making them arguably the most direct measure of the societal value of environmental models <ref type="bibr">[Laxminarayan and Macauley, 2012]</ref>. However, most currently used formulations for value-based metrics address the out-come of a (hypothetical) decision made with perfect information. As such they do not explicitly account for sampling uncertainty in validation data.</p><p>This study proposes the conditional value of sample information <ref type="bibr">(CVSI, [Schlaifer and Raiffa, 1961, p.89]</ref>) as an alternative value-based approach to evaluate models that predict continuous random variables. Unlike Value of Perfect Information approaches, CVSI uses a Bayesian framework to estimate the value of the information obtained from real observations with finite sample sizes. CVSI approaches have been extensively used in the water sciences to assist decisions, particularly related to the design of sampling networks and observation campaigns <ref type="bibr">[Feyen and Gorelick, 2005;</ref><ref type="bibr">James and Gorelick, 1994;</ref><ref type="bibr">Khader et al., 2013;</ref><ref type="bibr">Borisova et al., 2005;</ref><ref type="bibr">Bouma et al., 2009;</ref><ref type="bibr">Alfonso and Price, 2012]</ref> and investment in flood mitigation infrastructure <ref type="bibr">[Davis et al., 1972;</ref><ref type="bibr">Verkade and Werner, 2011;</ref><ref type="bibr">Roberts et al., 2009;</ref><ref type="bibr">Alfonso et al., 2016;</ref><ref type="bibr">Thiboult et al., 2017</ref>]. Yet to our knowledge, CVSI has not been used for value-based evaluation of model predictions.</p><p>Section 2 describes the model evaluation framework and discusses a critical distinction between the proposed metric, based on the original CVSI formulation by Schlaifer and <ref type="bibr">Raiffa [1961, p.89]</ref>, and the formulation used in most previous water studies. The approach is applied to a case study of flood mitigation in Section 3, where the effects of economic decision parameters and sample sizes are discussed. Section 4 uses CV SI for model calibration in the more complex case of the design of a streamflow diversion structure (e.g., supporting run-of-river hydropower), where the decision-relevant model outcome is endogenous. In this situation, where model predictions are used to optimize the capacity (a flow rate) of the structure, the model skill needs to be evaluated across a large and a priori unknown range of flow values that contains the optimal capacity to be determined. Section 5 discusses an important limitations of CV SI, namely that in its analytical form it can only be used to evaluate models that have identical parametric forms (for example, it would be useful in model calibration problems, but not for model selection across multiple possible model structures). We propose an alternative, modified metric that addresses this limitation and demonstrate its use for model selection in a cross-validation exercise. The examples presented in this study focus on modeling flow duration curves (i.e. the exceedance probability function of daily streamflow, obtained from its CDF), but the evaluation framework is applicable to any continuous random variable.</p><p>In a model evaluation context, model uncertainty results in sub-optimal decisions and a decrease in the associated utility <ref type="bibr">[e.g., Thiboult et al., 2017]</ref>. Utility is here used to designate the outcome the decision-maker wishes to maximize with their decision. The utility is often, though not necessarily, equivalent to net monetary profits. The magnitude of these utility losses is measured by the Value of Perfect Information (CVPI) <ref type="bibr">[Murphy, 1977;</ref><ref type="bibr">Laio and Tamea, 2007]</ref>, which diminishes as models improve. However, since perfect information is not available in real applications, the magnitude of losses due to model uncertainty must be estimated from finite data, subject to sample uncertainty.</p><p>One avenue to achieving such estimates, illustrated in Figure <ref type="figure">1</ref>, is to (i) calibrate the model (the 'prior') to be validated and obtain flow estimates; (ii) use this prior information to optimize the decision; (iii) update the flow estimates with observed flow conditions -these correspond to actual flow conditions during the 'service time' of the decision; (iv) re-optimize the decision based on the updated flow conditions, and (v) estimate the change in utility associated with the additional observations included in step (iii). In this approach, the fitted model to be evaluated is the prior [not the posterior as in, e.g., <ref type="bibr">Gelman and Shalizi, 2013]</ref>. A prior model is fitted using a first set of observations, here referred to as a 'calibration set' in step (i). Bayesian updating is then used to confront the fitted prior model with observed evidence in step (iii). We refer to this second, independent, set of observations as the 'validation set' within the context of this paper.The value of the information associated with the validation data set in step (v) can be measured using two alternative metrics.</p><p>The first metric (CV SI * in Figure <ref type="figure">1</ref>) measures the change in optimization outcomes under prior and posterior conditions. In other words, it compares the outcome that decision makers anticipate achieving when optimizing, with and without the additional observations. This metric, and its expected value over all possible sets of collected samples, are widely used to determine the expected benefit of collecting additional observations <ref type="bibr">[Davis et al., 1972;</ref><ref type="bibr">Ades et al., 2004;</ref><ref type="bibr">Strong et al., 2015;</ref><ref type="bibr">Dakins et al., 1996;</ref><ref type="bibr">Feyen and Gorelick, 2005;</ref><ref type="bibr">James and Gorelick, 1994]</ref>. Using CV SI * to evaluate models, however, is problematic because it compares a decision optimized based on modelled information with one based on observed information. As such, CV SI * is affected by changes in the maximum value of the expected utility function, and not solely by model performance. For example, regardless of model performance, CV SI * will take a negative value if observed flows suggest that an opti--6-Confidential manuscript submitted to Water Resources Research mal decision should be less ambitious and thus generate a lower utility than did the original model.</p><p>The second metric is the CV SI originally derived in Schlaifer and <ref type="bibr">Raiffa [1961, p.89]</ref>.</p><p>It compares two different decisions through their terminal utility, that is their outcome under (identical) posterior conditions. One decision is taken based on prior information alone, i.e. without accounting for the validation sample, which here represents the conditions encountered during the service period of the decision. Its outcome will be at most of equal value to that of an alternative (optimal) decision taken based on posterior information, i.e. with full access to all available information. The CV SI represents the change in terminal (i.e. posterior) utility between the two decisions. It is non-negative by construction and provides the metric for model evaluation. Subject to a particular decision and the cost structure associated with the decision, the smaller CV SI is, the better the model is. This approach follows directly from a standard Bayesian framework, such that a larger and more representative validation sample leads to a more substantial update of the model predictions, and a more sizable differentiation between the prior and posterior decisions. We formalize CVSI in mathematical terms using the notation from Schlaifer and Raiffa <ref type="bibr">[1961]</ref>. Consider a decision a -this could be, for example, a binary decision (a &#8712; {0, 1}), such as whether or not to undertake a management action, or a continuous design choice (a &#8712; IR + ), such as a the flow capacity of a diversion structure. The consequence, or outcome, of the decision is a utility that is typically (but not necessarily) defined as net monetary profits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Observations</head><p>The utility is affected by both the decision taken and an environmental random variable &#952;, and is represented by a utility function u(a, &#952;), where &#952; is a realization of &#952;. The probability density function (PDF) describing &#952; is given by a prior model M and termed p M (&#952;). The optimal decision a , given this prior, maximizes the expected utility taken over p M (&#952;):</p><p>where E &#952; M u(a , &#952;) designates the expected value (over the support of &#952;) of the utility that the decision-maker anticipates generating from decision a under prior conditions p M (&#952;) given by model M. The predictive performance of the model M can be assessed by collecting a sample of validation data, which we here refer to as the validation set z. The validation set can be used to update the modeled PDF of &#952;, using Bayes' theorem:</p><p>Here </p><p>The value of the information contained in the sample z, given the model M as prior belief, can be computed with two alternative expressions. The difference between the utilities anticipated by the decision maker under posterior and prior conditions,</p><p>is the formulation used as the value of sample information in most prior studies [e.g., <ref type="bibr">Davis et al., 1972;</ref><ref type="bibr">Ades et al., 2004;</ref><ref type="bibr">Strong et al., 2015;</ref><ref type="bibr">Dakins et al., 1996;</ref><ref type="bibr">Feyen and Gorelick, 2005;</ref><ref type="bibr">James and Gorelick, 1994]</ref>. Note that CV SI * M can be negative if z causes the value of the utility function to decrease, despite the reduced uncertainty. To avoid this confusion and isolate the effects of model performance, we use the original formulation in Schlaifer and Raiffa <ref type="bibr">[1961]</ref>, where the expected utility associated with each decision is computed in light of p &#952; |z , the posterior distribution of &#952;:</p><p>Here, E &#952; M |z u(a , &#952;) is the average utility under posterior conditions associated with the decision a , which was optimized under prior conditions. Note that a perfect model has a CV SI value of zero because the prior decision a associated with the model does not need to be revised. By extension, the incremental value of an 'improved' model M 2 (compared to a benchmark model M 1 of the same parametric form),</p><p>is positive only if M 2 allows a better decision a to be taken.</p><p>In the examples below, &#952; represents the probability P that daily streamflow Q exceeds a threshold Q c . M 1 and M 2 are models of streamflow CDFs that associate a single exceedance probability P M with each flow value Q c after being calibrated on a sample of N M (prior) observations. To represent the associated uncertainty, we model the event Q &gt; Q c as a</p><p>Bernoulli process with P M &#8226; N M successes over N M daily observations. Accordingly, &#952; is a beta-distributed random variable (i.e. with similar parametric form) for all the compared models.</p><p>3 Analytical illustration of CVSI</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Flood Mitigation</head><p>To illustrate the properties of the CVSI metric, an analytical case study is presented.</p><p>This case study relates to a binary decision, such as whether to employ mitigation measures against flood or low flow conditions. These decisions are based on the probability P that daily averaged streamflow Q exceeds (or fails to exceed) an exogenously given critical threshold Q c , where Q c could be based on flood stage, navigability, ecological connectivity or other criteria. Because a finite sample of flow observations is available, there is uncertainty in the estimation of P. It can be therefore be represented as a random variable (corresponding to &#952; in Section 2), the PDF of which affects the decision. In this case, the conjugate properties of the beta distribution (which describes the binary behavior of the decision) allow CVSI to be analytically determined.</p><p>Firstly, the probability P that streamflow exceeds the critical threshold Q c is defined as:</p><p>where F Q (&#8226;) designates the streamflow CDF. To associate utility with the decision of whether or not to implement a management action, a simple linear utility function is applied <ref type="bibr">[Alfonso et al., 2016]</ref>. The utility function adopts different values depending on (i) whether</p><p>or not mitigation is taken and (ii) whether or not the threshold is exceeded (this is designated f = 1, while f = 0 indicates that the threshold is not exceeded):</p><p>where the matrix elements C a f &lt; 0 denote consequences arising from all possible combinations of a and f and are negative when expressed as costs. The operator &#8226; &#215; &#8226; indicates matrix multiplication. Decision vector a takes the value of</p><p>The diagonal terms in C are consequences associated with making the 'right' decision.</p><p>They are typically minimal and we neglect them without loss of generality <ref type="bibr">[Murphy, 1977;</ref><ref type="bibr">Laio and Tamea, 2007]</ref>. Off-diagonal terms are (negative) consequences associated with errors of type I (mitigation is implemented but is not needed) and type II (mitigation is not taken but is needed). A prior PDF on the exceedence probability p M (P) is obtained from a model M informed by a calibration dataset of size N M . Here we consider a simple baseline model (M 1 ), which is an empirical CDF that draws on N M daily streamflow observations and allows the computation of empirical exceedance probabilities, P M :</p><p>where p &#946; (P, &#8226;, &#8226;) designates the two-parameter PDF of a beta distribution. The prior distribution is updated with additional information from a "validation" dataset z consisting of N z observations, and a probability P z of Q &gt; Q c within the dataset.</p><p>Using the conjugate prior properties of the &#946; distribution [e.g., <ref type="bibr">Bernardo and Smith, 2001, Section 5</ref>.2], an updated (posterior) distribution for the probability Pr {Q &gt; Q c } can can be analytically expressed as:</p><p>Using the validation dataset to update the prior distribution of P decreases its variance and shifts its expectation towards P z , as seen on Figure <ref type="figure">2 (c)</ref>.</p><p>Because the utility in Equation 8 is a linear function of the random variable P, the expected utilities can be expressed as <ref type="bibr">[Ades et al., 2004]</ref>:</p><p>where</p><p>and</p><p>for the prior and posterior distributions. Thus, CV SI can be analytically written as:</p><p>where a and a z are the user decisions that maximize utility under prior and posterior conditions, respectively:</p><p>and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Practical Example</head><p>To explore the properties of the CVSI metric for this analytic case, we applied it to the Khimti Kohla catchment, a 310 km 2 mountainous catchment in Eastern Nepal with no The complete streamflow dataset is available in Supplementary Information. The observed streamflow presents no evidence of non-stationarity, as determined by an Augmented Dickey</p><p>Fuller test <ref type="bibr">[Hamilton, 1994, Chap. 17</ref> Economic parameters are challenging to prescribe for this (and many other) catchments, so we follow <ref type="bibr">[Murphy, 1977;</ref><ref type="bibr">Laio and Tamea, 2007]</ref> in expressing the expected utility associated with the model as a function of the costs of Type I (C 10 ) and Type II (C 01 ) errors. Specifically, we consider how well the models predict the exceedance probability at which inaction (risking Type II errors) becomes as costly as action (risking Type I errors).</p><p>This threshold is computed as</p><p>It is conceptually analogous to the cost-loss ratio used in previous studies [e.g., <ref type="bibr">Roulin, 2007;</ref><ref type="bibr">Laio and Tamea, 2007]</ref> and can be interpreted as the maximum acceptable probability, from a cost-benefit standpoint, that streamflow exceeds a given critical value. A low value of P * is associated with a risk-adverse decision-maker who will invest in flood mitigation measures even if the probability that flow exceeds the critical value is low. Although P * is an inherent characteristic of the decision and determined by cost considerations, it affects the societal value of the information provided by the model, as measured by CVSI. The effect of costs on CVSI is illustrated in Figure 3 (a), where the CVSI and CV SI * of M 1 are displayed for increasing values of P * . For sufficiently extreme values of P * , corresponding to either risk-averse (low P * ) or risk-tolerant (high P * ) decision-makers, flood mitigation measures are respectively prescribed or discouraged in both the prior (model M 0,a )</p><p>and posterior (M 0,a and validation data z) situations. In other words, the decision taken solely based on model predictions (prior) is correct and would not be different if the validation data had been available when the decision was taken (posterior). In that situation, the CV SI associated with M 0,a has a perfect value of 0 (green in Figure <ref type="figure">3</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Endogenous thresholds: Design of a Flow Diversion Structure</head><p>Having illustrated the features of the CVSI metric in the simple analytical example, we turn to its use for model selection in the more general case where the critical flow Q c is endogenous to the problem. We initially restrict our analysis to model calibration, that is the selection of a specification among models of identical parametric forms. An extension of the method to a more general model selection problem is outlined in Section 5.1.</p><p>Consider a river diversion structure, where costs are affected by the size of the structure (proportional to the decision variable Q c , the maximum critical flow that the structure can exploit), and revenues by the volume of the diverted water when the infrastructure is functioning at full capacity (proportional to both Q c and its associated exceedance probability)[e.g. <ref type="bibr">Basso and Botter, 2012]</ref>. The optimal decision Q c in this case does not depend on how well a model predicts a single flow quantile, but a whole region of the CDF with unknown extent. Specifically, assuming that the dependence of construction costs and revenues on the flow capacity is linear [e.g., <ref type="bibr">M&#252;ller et al., 2016]</ref>, profits are expressed as the revenue associated with the diverted water, minus the total costs of the infrastructure. Accordingly, the utility function associated with the design decision on Q c is expressed as:</p><p>The integral</p><p>represents the volume of diverted water, where</p><p>is the flow duration curve (FDC) describing the probability that streamflow exceeds Q, Q c is the maximum flow that can be processed by the infrastructure, and &#945;Q c is the flow below which no water is diverted. Parameter &#947; 0 is the unit revenue associated with the diverted water and &#947; the cost of the infrastructure by unit capacity.</p><p>Models for FDCs typically associate a single exceedance probability P M (Q) with each flow quantile Q [e.g. <ref type="bibr">Castellarin et al., 2004b;</ref><ref type="bibr">Botter et al., 2007;</ref><ref type="bibr">M&#252;ller et al., 2014]</ref>. CVSI, conversely, acknowledges that the association between P and Q is uncertain (e.g., due to finite samples of observations) with a conditional PDF p(P | Q) associated with the exceedance probability P of each flow value Q. To develop this probabilistic association, we sample p(P | Q) numerically. Consider an ordered list of K flow values Q (1) &lt; ... &lt; Q (r) &lt; ... &lt; Q (K) . As in the previous section, we model the event Q &gt; Q (r) as a Bernouilli process with P (r) M &#8226; N M successes over N M trials, which represent number of observations used to calibrate the FDC model. Here P (r) M is a conditional exceedance probability computed from the FDC model M. Accordingly, the probability P{Q &gt; Q (r) } can be drawn from a (different) beta distribution for the flow value associated with each rank r. This sampling procedure is used to construct numerical instances of flow CDFs for prior or posterior conditions, as detailed in Section S1 of Supplementary Information. Prior and posterior decisions (Q c and Q c,z ), and corresponding utility values, can then be computed by maximizing the expected utility across the relevant (prior or posterior) ensemble of numerically generated FDCs:</p><p>where N s is the number of sampling runs and</p><p>) are the utility (Equation <ref type="formula">18</ref>) associated with the decision Q c for a numerically generated instance i of the flow duration curve under prior and posterior conditions, respectively. The CVSI associated with model M is finally obtained as</p><p>where E P | M,z u(Q c , P) is the expected utility that emerges from applying the prior decision to the posterior ensemble of FDCs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Practical Example</head><p>To illustrate the use of CVSI to select alternative model specifications (i.e. model calibration), consider the design of a run-of-river hydropower plant in the Khimti Khola catchment described in Section 3.2. The decision-maker must choose between two alternative datasets of streamflow observations (M 0,a and M 0,b ) to construct empirical CDFs, based on which to design the infrastructure. Note that both datasets are extensive (15+ years of observations), but patchy and yield substantially different empirical CDFs (Figure <ref type="figure">2 (b)</ref>). The two alternative model specifications (M 0,a and M 0,b ) will be evaluated by computing the CV SI associated with the design decision, using a smaller (but complete) dataset of recent observations as a validation dataset z. <ref type="table">1</ref> show that consideration of the validation dataset results in a substantial (+7%) revision to the infrastructure capacity obtained from the prior associated to M 0,a , with a related CV SI of about $90, 000. In contrast, no revision is necessary for the prior design based on M 0,b (i.e. CV SI = 0). This suggests that specification M 0,b (smallest CV SI) is preferable and would result in a comparative increase of $90, 000 over the seven years considered as a validation period (N z ). In contrast, goodness-of-fit assessment tend to prefer model specification M 0,a over M 0,b , as evidence by the Nash-Sutcliffe Efficiency, the Akaike Information Criterion and the Kolmogorov-Smirnov distance, also displayed on Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CVSI results in Table</head><p>Note that CV SI outcomes depend crucially on the technical (cutoff ratio &#945;) and economic (cost-benefit ratio &#947;/&#947; 0 ) parameters that drive hydropower profits and were realistically (but arbitrarily) set at &#945; = 0.4 and &#947;/&#947; 0 = 0.2 for the results in Table <ref type="table">1</ref>. To investigate this effect, we carried out a sensitivity analysis, where the CV SIs associated with M 0,a and M 0,b were determined for numerous combinations of parameter values. The estimation was based on numerically generated (N MC = 1000) ensembles of prior and posterior FDC obtained through the procedure described in Supplementary Information (Section S2). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Model Selection with CV SI</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Modified Metric</head><p>A basic premise in using CSV I to evaluate models is that a model that 'learns' less from a given set of additional observations z already 'knows' more, and thus would have allowed a better decision to be taken. Yet a model's capacity to learn from new data is determined both by what it already knew (our evaluation metric) and by the model's inherent  ability to assimilate new information, which is related to its level of complexity. To illustrate, consider a model representing streamflow as a uniform distribution between observed extremes. This model will not learn from additional observations unless values beyond historical ranges are observed. Its associated CV SI will therefore likely take a (perfect) value of zero, despite an objectively poor prediction performance. Models with different structural forms, calibrated and updated with similar data, would not be subject to the same restrictions on 'learning'. Comparing these models through the CV SI approach is not directly possible. This represents an inherent limitation of CV SI and restricts its current use to comparing probabilistic models of identical parametric forms, as in Section 4.2.</p><p>We address this limitation by defining a modified evaluation metric CV SI, which associates a single exceedance probability, rather than a full PDF, to any given flow value. This means that in contrast to other value-of-information metrics (e.g., CV SI in Section 4 or the Value of Perfect Information in <ref type="bibr">[Laio and Tamea, 2007]</ref>), it does not explicitly incorporate prediction uncertainties. The modified metric is given by:</p><p>where Q c and Qc,z are utility-maximizing values of Q c in the respective situations, and where the FDCs are respectively given by the model M (P M (Q)) and constructed empirically from the validation dataset (P z (Q)). The modified metric leverages the fact that the ordinal outcome of model evaluation using CV SI is unaffected by the size of the validation (N z ) and calibration (N M ) datasets (see proof and numerical analysis in Supplementary Information Section S2). Intuitively, small validation datasets do not allow decisions taken under any model prior to be updated (CV SI = 0). Increasing sample sizes allow to first update the decision taken under the worse model prior (CV SI &gt; 0), while the best prior decision remains unchanged (CV SI = 0). At large enough sample sizes, both prior decisions will be updated (if applicable), but the update related to the worse prior model will yield a higher increased value, hence a larger CV SI. To the limit, CV SI applied to infinite sized calibration and validation datasets will preserve the ranks of the models being compared.</p><p>The modified metric is solely based on the model's ability to fit a particular (and a priori unknown) subset of the validation data (z) that is relevant to the decision at hand. For example, when considering hydropower design, models are evaluated based on their ability to fit exceedance probabilities associated with flow quantiles that fall between &#945;Q c and Q c . Because Q c is the decision variable, this interval is not known a priori, which precludes the use of (SSE-type) goodness-of-fit metrics. Despite this important distinction, there is a conceptual similarity between CV SI and goodness-of-fit metrics. This similarly means that the CV SI can be used for model selection as well as calibration, as illustrated in the crossvalidation analysis in the following section. The key distinction allowing CV SI to be used for different models arises from dispensing with the probabilistic components of the metric, meaning that it primarily acts as a goodness of fit metric, but weighted appropriately to represent the importance of the model for altering decisions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Practical Example</head><p>In Section 4.2, CV SI was used to compare different specifications of a same model. In all the compared cases, flow distribution was modeled using an empirical CDF constructed from different observation samples of the same size. Here we now use, CV SI to compare different types of FDC models in their ability to inform run-of-river hydropower design in the (b), where the FDC predicted by M 2 (purple) fits the validation FDC (red) better than M 1 (orange) at low flow quantiles, despite a worse fit across quantiles. Note that, although M 2 corresponds to a (slightly) lower average CV SI across validation runs than M 0 , the performance of the two models cannot be distinguished because the distributions of CV SI overlap substantially. This illustrates the fact that although not incorporated in the metric itself (like in the case of CV SI), the effect of sampling uncertainty on model evaluation can be accounted for by implementing CV SI in a cross-validation analysis. than M 2 (purple) at higher flow quantiles, thus resulting in a higher NSE value. Yet M 2 has a better fit than M 1 at lower flow quantiles (inserted graph), which are relevant for hydropower design under the considered economic and technical constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and conclusion</head><p>The evaluation approach proposed in this paper is based on the premise that -for practical applications at least -the selection of a model should be determined by its end use. Our results, which suggest that the model selected by a SSE-type metric may not provide most value to the end user, support this premise. The exploration presented here reveals four common model evaluation challenges, and shows the proposed metrics can address them, as follows:</p><p>First, practical applications often rely on prediction performances at specific (and sometimes extreme) flow quantiles, which are not necessarily known a priori.This is illustrated in the case study in Section 5.2 where model M 2 better fits the validation sample in the range of quantiles that are relevant for hydropower production (Figure <ref type="figure">5</ref>  Second, the socio-economic value of model predictions is dependent on economic and technical parameters, in addition to the quality of the model's flow predictions <ref type="bibr">[Laio and Tamea, 2007;</ref><ref type="bibr">Thiboult et al., 2017]</ref>. Our results illustrate how value-based model performance should not be understood as a simple scalar, but rather as a function of governing pa- </p></div></body>
		</text>
</TEI>
