<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Explainable Nonparametric Importance Sampling in Stochastic Simulation</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the IISE Annual Conference &amp; Expo</publisher>
				<date>05/30/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10557922</idno>
					<idno type="doi"></idno>
					
					<author>Chenfei Li</author><author>Jaeshin Park</author><author>Eunshin Byon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose a new explainable multivariate importance sampling method in stochastic simulation. Importance sampling has the potential to significantly reduce the estimation variance when its instrumental density is properly designed. Prior studies have demonstrated the advantages of using a nonparametric method when the relationship between input variables and the output variable is unknown. However, importance sampling easily faces the curse of dimensionality, as the input dimension grows. This study devises a new method that identifies crucial input variables to effectively formulate a nonparametric instrumental density and explains the importance of each variable. The wind turbine reliability case study demonstrates the method's efficiency.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The recent advances in computing power have greatly increased the prevalence of computer simulation modeling. These simulations are frequently utilized to assess system performance metrics, including reliability, across a wide range of application areas. This study is specifically motivated to evaluate wind turbine reliability using stochastic simulators. Wind power has emerged as one of the promising renewable energy sources. Ensuring the wind turbine's reliability is a fundamental consideration to prevent potential structural failure <ref type="bibr">[5]</ref>. Accordingly, the National Renewable Energy Laboratory (NREL) under the U.S. Department of Energy (DOE) has developed aeroelastic simulation models, such as TurbSim <ref type="bibr">[7]</ref> and FAST <ref type="bibr">[8]</ref>. These tools aid in comprehensive analysis, contributing to the enhancement of wind turbine reliability.</p><p>Studies in stochastic simulations predominantly use two approaches to estimate system performance: the Monte Carlo sampling technique and the emulator-based approach <ref type="bibr">[10]</ref>. The emulator-based technique constructs a surrogate model designed to approximate the response surface, and system performance estimations are derived from the resulting surrogate model. Although this approach proves useful in modeling or characterizing the average response on the response surface, it may not be suitable for calculating the probability of rare events <ref type="bibr">[1]</ref>. Hence, in this study, we employ the Monte Carlo sampling approach to estimate the failure probability.</p><p>Among several Monte Carlo sampling techniques, importance sampling (IS) is a powerful tool that is well-known for its ability to reduce estimation variance. Variance reduction in stochastic simulation is translated to enhanced computational efficiency, as the variance gets reduced with larger data size. The fundamental idea of IS method is to employ an instrumental density for simulating data samples smartly, thereby facilitating precise estimations. Nevertheless, IS is prone to the curse of dimensionality as the input dimension grows. Recently, Li et al.(2021) <ref type="bibr">[10]</ref> introduced a nonparametric IS that effectively navigates the issues of multivariate input, particularly when multiple input variables interact with each other. Their approach outlines the significance of each factor, treating them distinctively when formulating an instrumental density. However, the problem is that their approach includes all input variables in building the instrumental density. It is often observed that the performance of engineering systems is primarily influenced by a selected number of critical variables. This is underscored by the concept of the parsimony principle. When all non-critical variables are considered collectively, the efficiency of the IS method could be deteriorated.</p><p>In this paper, we introduce a new approach aimed at identifying crucial input variables and designing an instrumental density that is formulated with these selected variables. The resulting instrumental density provides a comprehensive overview of the impact of selected variables on the final results, thereby providing an explainability attribute to the methodology. The variables chosen and their parameters demonstrate their significance in estimating quantities of Li, Park and Byon interest, such as a reliability measure. We apply our method to a case study in order to evaluate wind turbine reliability. Our findings demonstrate the superiority of this method, as it successfully identifies key input variables and provides improved variance reduction capability, compared to existing strategy that considers all variables <ref type="bibr">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Problem Description</head><p>We consider a black box computer model that generates the output Y &#8712; R 1 at input X &#8712; D &#8838; R d following a distribution f (X). Computer models can be broadly categorized into deterministic or stochastic models, depending on the nature of the output Y being fixed or with noise at a given input X. This study is concerned with stochastic computer model, due to its growing popularity <ref type="bibr">[3]</ref>. In particular, we are interested in estimating the failure probability P(Y &gt; l), where l denotes a resistance level. This failure probability is also referred to as the probability of exceedance (POE).</p><p>Noting that P(Y &gt; l) = &#967; f P(Y &gt; l|X = x) f (x)dx with &#967; f denoting the domain of X, the crude Monte Carlo (CMC) method samples the input X i from f and runs the computer model at X i to get Y i . Repeating the procedure, CMC estimates the POE by</p><p>Here, N T denotes the total number of samples.</p><p>On the other hand, IS samples inputs from an instrumental density q instead of f and estimates the POE by</p><p>With the stochastic computer model, PIS is an unbiased estimator when the support of q, denoted as supp{q(x)}, includes supp{P(Y &gt; l | X = x) f (x)} <ref type="bibr">[12]</ref>. In <ref type="bibr">[3]</ref>, it has been shown that the optimal IS density that minimizes the variance of the failure probability estimator PIS is given by</p><p>where s(x) = P(Y &gt; l|X = x) is the conditional failure probability at x and C q = &#967; f f (x) s(x)dx is the normalizing constant. The subscript SIS stands for the importance sampling for the stochastic computer model, or shortly, stochastic importance sampling <ref type="bibr">[9]</ref>. With q * SIS , PIS in (2) is an unbiased estimator <ref type="bibr">[3]</ref>. While the density illustrated in (3) represents a theoretically optimal instrumental density, it cannot be straightforwardly implemented. This challenge arises because s(x), a crucial component of the model, is unknown. This uncertainty is primarily due to the black-box nature of the computer model. Consequently, our study will primarily focus on identifying an effective estimation method for s(x). Several approaches have been proposed for approximating s(x) (or q * SIS more broadly), including parametric, cross-entropy, and nonparametric approaches <ref type="bibr">[2,</ref><ref type="bibr">4,</ref><ref type="bibr">10]</ref>. <ref type="bibr">Li et al.(2021)</ref> [10] discussed that parametric and cross-entropy approaches encounter difficulties when the input dimension is three or higher and proposed a new nonparametric technique as a remedy. However, the nonparametric method in <ref type="bibr">[10]</ref> still demonstrates its own limitations in managing multivariate input as the input dimension increases. In the following section, we introduce a new method to discern key input variables vital to obtaining the instrumental density.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>To estimate s(x) in (3), we utilize kernel regression <ref type="bibr">[10]</ref>, as follows.</p><p>with</p><p>, where K d represents a d-dimensional kernel function. There are several potential kernel functions that could be used for K d , each with its own advantages and disadvantages. The additive kernel, which aggregates all univariate kernels, offers a dimension-scalable methodology. However, its limitation lies in failing to recognize interaction effects among input variables. In contrast, the full multiplicative kernel, which involves the multiplication of all univariate kernels, accounts for all interaction effects <ref type="bibr">[6,</ref><ref type="bibr">13]</ref>. However, this approach necessitates a sufficiently large data set for kernel construction and may risk data sparsity issues when the input dimension is three or higher.</p><p>Seeking a balanced evaluation between the two extremes of fully additive and fully multiplicative kernels, Li et al. <ref type="bibr">[10]</ref> put forth a weighted additive multivariate kernel for stochastic importance sampling method, referred to as WAMK-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Li, Park and Byon</head><p>SIS. This method employs all of the d 2 bivariate kernels for estimating s(x) in ( <ref type="formula">3</ref>) with the d-dimensional input. The use of bivariate kernels enables the characterization of interaction effects present between any pair of variables. The fundamental premise of WAMK-SIS is the assumption of high-order interactions being negligible, with a belief that a two-way interaction is fully capable of depicting output behavior. Specifically, WAMK-SIS employs the following estimator for s(x).</p><p>with &#8721; 1&#8804;p&lt;q&#8804;d w pq = 1, where the bivariate kernel B(x p , x q ) is defined by</p><p>Here X pi and X qi denote the ith random samples of the two input variables x p and x q , respectively, and K 2 implies a 2dimensional multiplicative kernel and h p and h q denote the bandwidths of two variables x p and x q , respectively. In ( <ref type="formula">5</ref>), w pq denotes the weight of B(x p , x q ), quantifying the importance of B(x p , x q ), defined by the following cross-entropy error function</p><p>with</p><p>Note that e(p, q) diminishes as B(x p , x q ) approaches Z i more closely. In this sense, a smaller e(p, q) indicates good estimation of the bivariate kernel B(x p , x q ), and hence, should be correspondingly assigned a greater weight, as shown in <ref type="bibr">(7)</ref>.</p><p>While the WAMK-SIS demonstrates superiority over CMC and other parametric methodologies <ref type="bibr">[10]</ref>, it still has limitations. Firstly, as the dimension d grows, the number of bivariate kernels increases significantly, specifically in the order of d 2 . This increase results in unwarranted complexity in the form of both the estimated conditional failure probability and the instrumental distribution q. Secondly, the parsimony principle often holds in engineering systems, suggesting that a system's functioning is typically governed by a few crucial factors. According to this principle, a simpler model could be sought. Thirdly, the comprehensive implementation of bivariate kernels could lead to redundant kernels. For instance, consider two bivariate kernels, B(x 1 , x 2 ) and B(x 1 , x 3 ). Suppose x 2 and x 3 are of minimal significance and their interactions with x 1 are negligible; under these circumstances, the two kernels might closely resemble each other, that is, B(x 1 , x 2 ) &#8776; B(x 1 , x 3 ). This indicates that &#349;(x) f ull in (5) could contain redundant terms.</p><p>To improve the WAMK-SIS, our approach is to identify important bivariate kernels. The idea is similar to feature selection (or subset selection) in machine learning. Note that, with d 2 bivariate kernels, there are 2 ( d 2 ) -1 possible estimators for approximating s(x) in (3). To choose the best one among them, our primary analytical tool is a modified version of cross-validation (CV) technique which is customized for the IS context. Suppose we have a pilot dataset or dataset obtained in previous iterations in the iterative procedure. We partition this dataset into two exclusive sets, training and validation sets. We systematically navigate through every subset of d 2 bivariate kernels. For each subset, we construct a POE estimator anchored on the training set, and then gauge its performance using the validation set. This CV process is performed multiple times, each utilizing randomly partitioned training and validation sets. Finally, we identify the estimator with the minimum variance of the POE estimations.</p><p>Note that applying the CV procedure for a total of 2 ( d 2 ) -1 possible candidates could be computationally burdensome. Thus, before applying the CV procedure, we first pre-screen the candidates so we can apply the CV to those selected candidates only. Specifically, let &#8486; k denote the kth nonempty subset that contains the indices of bivariate kernels for</p><p>For instance, for the k th conditional POE estimator &#349;k (x) with B(x 1 , x 2 ) and B(x 1 , x 3 ), we have</p><p>where B(x p , x q ) can be obtained using <ref type="bibr">(6)</ref>, with w pg defined in ( <ref type="formula">7</ref>)- <ref type="bibr">(8)</ref>. To obtain &#349;k (x), we use the total data available to us, either from pilot sampling or previous iterations.</p><p>After obtaining all possible estimators, we choose the top candidates using the cross-entropy error function. The crossentropy error function in ( <ref type="formula">8</ref>) is used to quantify the importance of each bivariate kernel. Now, we replace the bivariate Li, Park and Byon kernel function value with each candidate estimator &#349;k (x) to get its cross-entropy function as follows.</p><p>where n is the number of data used to build &#349;k (x). Note that an estimator with small CE k implies that &#349;k (X i ) is close to 1 when there is a failure (or exceedance event) at X = X i (i.e., Z i = I(Y i &gt; l) = 1), whereas it is close to 0 when there is no failure. Thus, we choose estimators with small CE k values. In our case study, we choose ten such estimators.</p><p>Let &#349;(k) (x), k = 1, &#8226; &#8226; &#8226; , 10, denote the kth pre-screened candidate. Then we employ the CV procedure to choose the best one among them. As a remark, one may suggest using <ref type="bibr">(10)</ref>, instead of employing the CV procedure, to choose the best estimator. However, doing so could lead to an overfitting issue. Moreover, while the cross-entropy error function quantifies the prediction accuracy of each estimator for the conditional POE s(x), our ultimate goal is to minimize the variance of the POE estimator. For the CV implementation, we divide the entire data set into the training and validation sets. Next, we construct &#349;(k) (x) using the training set for each pre-screened candidate and evaluate its performance using the validation set. Let q (k) denote the instrumental density with the trained &#349;(k) (x), i.e.,</p><p>With q (k) , the POE estimated in the validation set becomes</p><p>where n vld is the number of data sampled from q (k) to test its performance in the validation set. Note that, while we use the notation q (k) (X i ) for the trained instrumental density, it is formulated with the selected bivaraiate kernels only.</p><p>Let &#8486; (k) denote the kth set of bivariate kernels associated with &#349;(k) (x). Ideally, we should choose the best subset &#8486; (k) that leads to the smallest variance of P(k) in <ref type="bibr">(12)</ref>. The challenge is that P(k) is not readily available. Note that, in order to evaluate the performance of q (k) , X i should be sampled from q (k) and Y i should be obtained by running the computer model at the sampled X i . However, running the computer model n vld times for all candidates q (k) is computationally demanding. On the other hand, without running the computer model, Y i in ( <ref type="formula">12</ref>) is not obtained.</p><p>We address the challenge by approximating I(Y i &gt; l) in ( <ref type="formula">12</ref>) with the estimated conditional POE. Here, if we use &#349;(k) (x) in <ref type="bibr">(9)</ref>, which is learned with the training set, to approximate I(Y i &gt; l), it will incur the overfitting issue. Moreover, it violates the underlying principal of CV: train a model in the training set and test the resulting model in a separate validation set. Our remedy is to re-learn the conditional POE using the validation set data. Then the new conditional POE will play a role of proxy that replaces I(Y i &gt; l) in <ref type="bibr">(12)</ref>.</p><p>Let &#349;vld (k) (x) denote the estimated conditional POE with the bivariate kernels, each with the input pair in &#8486; (k) , using the validation set. We then approximate P(k) in <ref type="bibr">(12)</ref> as follows.</p><p>Note that q (k) (X i ) is the instrumental density constructed with the training set. We evaluate it using the validation set via &#349;vld (k) (x). It has been known that the normalizing constant C q (k) in ( <ref type="formula">11</ref>) could be unstable for multivariate instrumental density when the input dimension is three or higher <ref type="bibr">[10]</ref>. Thus, we instead use the self-normalizing constant that replaces n vld with &#8721; n vld i=1 f (X i )/q (k) (X i ) in our implementation. We repeat this CV procedure multiple times with randomly partitioned training and validation sets and compute the sample variance of the POE estimator P(k) in <ref type="bibr">(13)</ref>. Finally, we select the best subset &#8486; (k) that produces the smallest sample variance. Our approach identifies key input variables and produces more explainable instrumental density. Accordingly, we refer our method as explainable IS, shortly X-SIS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Wind Turbine Case Study</head><p>We use the NREL's aeroelastic wind turbine simulators, including TurbSim (version 1.50) <ref type="bibr">[7]</ref> and FAST (version 7.01.00a-bjj) <ref type="bibr">[8]</ref>, to estimate the load exceedance probability P(Y &gt; l). Specifically, we consider the 10-minute maximum tip deflection as the response variable Y , which is one of the essential load responses related to the wind turbine stability <ref type="bibr">[11]</ref>. We set the resistance level at l = 2.06. We consider the following five input variables as in <ref type="bibr">[10]</ref>, including the wind speed V , turbulence intensity T I, wind shear S, wind vertical angle VA, and surface roughness length, SR. Therefore, we have X = (V, T I, S,VA, SR). For their original input distribution f , we use the same setting Li, Park and Byon as outlined in <ref type="bibr">[10]</ref>. We conduct 25 experiments. Each experiment draws 600 pilot samples from uniform distribution and includes 10 iterations, that each iteration generates 200 samples from the NREL simulators. </p><p>Using the training set, obtain &#349;(k) (x) using ( <ref type="formula">9</ref>) to construct the instrumental density q (k) in (11). 9:</p><p>With data in the validation set, obtain &#349;vld (k) (x) using ( <ref type="formula">9</ref>) and calculate P(k) using <ref type="bibr">(13)</ref>.</p><p>11:</p><p>end for 12: end for 13: Find the standard error of P(k) for k = 1, 2, &#8226; &#8226; &#8226; , 10 and choose the best q * (k) with the smallest standard error of P(k) . 14: Choose the best q * (k) that leads to the smallest standard error of P(k) . 15: Output: &#8486; (k) associated with q * (k) .</p><p>We compare our approach X-SIS with the WAMK-SIS. In addition to assessing the standard error (SE) of POE estimates from 25 experiments, we further evaluate the estimator performance using relative ratio (RR), which is defined as RR = N T /N CMC . Here, N CMC denotes the number of CMC replications necessary to attain a similar SE, given by N CMC = P T (1 -P T )/SE 2 with P T = P(Y &gt; l) <ref type="bibr">[3]</ref>. Here P T is the true failure probability, which is unknown. Thus, we use the estimated POE in computing RR.  <ref type="table">1</ref> summarizes the results. The means of 25 POE estimates in both methods are similar. However, the results suggest that the X-SIS estimator achieves smaller SE than the WAMK-SIS. It uses only 36% computational resources required by CMC, implying that the CMC needs about three times larger repetitions to the same SE level of 0.0029 from the X-SIS. While WAMK-SIS reduces the variance over CMC, it underperforms the X-SIS.  <ref type="table">2</ref> reports the total number and percentage of bivariate kernels chosen in X-SIS over 10 iterations in 25 experiments. In all cases, the X-SIS chooses two or three bivariate kernels, which is very few compared to the WAMK-SIS estimator that uses 10 bivariate kernels. Table <ref type="table">3</ref> further summarizes the selected bivariate kernels. In most cases, the bivariate kernel B(V, T I) is employed in constructing the instrumental density. This implies that these two variables are the most important among five input variables and their interaction is significant. Out of the candidates, the three kernels B(V, T I), B(V, S), and B(T I,VA) are employed to form the instrumental density most frequently.</p><p>Notably, the estimators that employ the three bivariate kernels primarily encompass B(V, T I) and B(V, S) in most cases, suggesting that the three variables, V , T I, and S, are considered vital input variables impacting wind turbine reliability. These estimators, however, often include an additional bivariate kernel, as shown in Table <ref type="table">3</ref>. We believe that</p><p>Table 3: Frequency of selected bivariate kernels this inclusion is an attempt to prevent overfitting through the CV. For instance, consider the estimators in the first two rows in Table <ref type="table">3</ref>. While B(V, T I) and B(V, S) can characterize the exceedance pattern well, their sole inclusion might result in the instrumental density overly concentrating on a specific region, which could potentially induce estimation bias and escalated variance. The CV process aids in circumventing such unfavor-Li, Park and Byon able scenarios. A similar phenomenon can be observed with estimators that incorporate B(V, T I), the most significant bivariate kernel (refer to the third and fifth row estimators). In addition to B(V, T I), these estimators add an extra kernel such as B(S,VA) and B(VA, SR). It is noteworthy that these supplementary kernels can be considered as less critical, even less so than the third kernel in the estimators consisting of three bivariate kernels, e.g., B(T I,VA) in the first row estimator. By coupling these kernels with the most important kernel B(V, T I), the resulting instrumental density can effectively circumvent overfitting issues.</p><p>Additionally, the proposed method effectively steers clear of repetitive terms in estimating s(x). It is worth noting that the selected estimator does not incorporate multiple kernels that comprise the same significant variable intertwined with other less important variables. For instance, consider the first and fourth row estimators that include B(T I,VA) and B(T I, SR), respectively. B(T I,VA) and B(T I, SR) mirror a univariate kernel with only T I, because VA and SR are insignificant and thus, their bandwidths are much wider than that of T I. Accordingly, no estimators include both B(T I,VA) and B(T I, SR) simultaneously. We observe a similar trend across other estimators as well.</p><p>In summary, the X-SIS estimator ensures the chosen kernels encompass key input variables. This estimator frequently employs B(V, T I) and B(V, S), underscoring the significance of the three variables along with their interactions. In addition to choosing vital kernels, it effectively circumvents overfitting issues, thus averting the potential problem of the instrumental density focusing excessively on overly narrow regions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Summary</head><p>This study devises a new nonparametric importance sampling method designed for assessing reliability with multivariate inputs. Our primary focus is on explaining the significance of each input, facilitating the construction of an instrumental density with enhanced effectiveness. To address the challenge of overfitting, we propose a modified CV procedure within the framework of IS. Notably, our new CV procedure avoids additional computational burden for generating new data samples while effectively mitigating the risk of overfitting. It achieves this by guiding the selection of suitable bivariate kernels, ultimately reducing the variance in the resulting estimation. In the future, we plan to extend the approach to estimate extreme loads in wind turbine application <ref type="bibr">[4]</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>https://doi.org/10.21872/2024IISE_6545</p></note>
		</body>
		</text>
</TEI>
