<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Data Pooling in Stochastic Optimization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10251657</idno>
					<idno type="doi">10.1287/mnsc.2020.3933</idno>
					<title level='j'>Management Science</title>
<idno>0025-1909</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Vishal Gupta</author><author>Nathan Kallus</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Managing large-scale systems often involves simultaneously solving thousands of unrelated stochastic optimization problems, each with limited data. Intuition suggests that one can decouple these unrelated problems and solve them separately without loss of generality. We propose a novel data-pooling algorithm called Shrunken-SAA that disproves this intuition. In particular, we prove that combining data across problems can outperform decoupling, even when there is no a priori structure linking the problems and data are drawn independently. Our approach does not require strong distributional assumptions and applies to constrained, possibly nonconvex, nonsmooth optimization problems such as vehicle-routing, economic lot-sizing, or facility location. We compare and contrast our results to a similar phenomenon in statistics (Stein’s phenomenon), highlighting unique features that arise in the optimization setting that are not present in estimation. We further prove that, as the number of problems grows large, Shrunken-SAA learns if pooling can improve upon decoupling and the optimal amount to pool, even if the average amount of data per problem is fixed and bounded. Importantly, we highlight a simple intuition based on stability that highlights when and why data pooling offers a benefit, elucidating this perhaps surprising phenomenon. This intuition further suggests that data pooling offers the most benefits when there are many problems, each of which has a small amount of relevant data. Finally, we demonstrate the practical benefits of data pooling using real data from a chain of retail drug stores in the context of inventory management.            This paper was accepted by Chung Piaw Teo, Special Issue on Data-Driven Prescriptive Analytics.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The stochastic optimization problem</p><p>is a fundamental model, with applications ranging from inventory management to personalized medicine. In typical data-driven settings, the measure P governing the random variable &#958; is unknown. Instead, we have access to a data set S { &#958;1 , . . . , &#958;N } independent and identically distributed (i.i.d.) from P and seek a decision x &#8712; X depending on these data. This model and its data-driven variant have been extensively studied in the literature (see <ref type="bibr">Shapiro et al. 2009</ref> for an overview).</p><p>Managing real-world, large-scale systems, however, frequently involves solving thousands of potentially unrelated stochastic optimization problems like Problem (1.1) simultaneously. For example, inventory management often requires optimizing stocking levels for many distinct products across categories, not just a single product. Firms typically determine staffing and capacity for many warehouses and fulfillment centers across the supply chain, not just at a single location. Logistics companies often divide large territories into many small regions and solve separate vehicle-routing problems, one for each region, rather than solve a single monolithic problem. In such applications, a more natural model than Problem (1.1) might be</p><p>] ,</p><p>(1.2)</p><p>where we solve a separate subproblem of the form (1.1) for each k, for example, setting a stocking level for each product. Here, &#955; k &gt; 0 represents the frequency with which the decision maker incurs costs from problems of type k, and &#955; avg 1 K &#8721; K k 1 &#955; k . Thus, in Problem (1.2), our total costs are driven by the frequencyweighted average of the costs of many distinct optimization problems.</p><p>Of course, intuition strongly suggests that since there are no coupling constraints across the feasible regions X k in Problem (1.2), one can and should decouple the problem into K unrelated subproblems and solve them separately. Indeed, when the measures P k are known, this procedure is optimal. When the P k 's are unknown and unrelated, but one has access to a data set S k { &#958;k,1 , . . . , &#958;k, Nk } drawn i.i.d. from P k independently across k, intuition still suggests that decoupling is without loss of generality and that datadriven procedures can be applied separately by subproblem.</p><p>A key message of this paper is that this intuition is false.</p><p>In the data-driven setting, when solving many stochastic optimization problems, we show that there exist algorithms that pool data across subproblems that outperform decoupling, even when the underlying problems are unrelated and the data are independent. This phenomenon holds, despite the fact that the kth data set S k tells us nothing about P l for l k and there is no a priori relationship between the P k . We term this phenomenon the data-pooling phenomenon in stochastic optimization.</p><p>Figure <ref type="figure">1</ref> illustrates the data-pooling phenomenon with a simulated example for emphasis. Here, K 10, 000, and the kth subproblem is a newsvendor problem with critical quantile 90%, that is, c k (x;&#958;) max 9(&#958;-x),(x-&#958;) { } . The measures P k are fixed, and in each run we simulate Nk 20 data points per subproblem. For the decoupled benchmark, we use a standard method, Sample Average Approximation (SAA; Definition 2.1), which is particularly well-suited to the data-driven newsvendor problem <ref type="bibr">(Levi et al. 2015)</ref>. For comparison, we use our novel Shrunken-SAA algorithm, which exploits the data-pooling phenomenon. We motivate and formally define Shrunken-SAA in Section 3, but, loosely speaking, Shrunken-SAA proceeds by replacing the kth data set S k with a "pooled" data set that is a weighted average of the original kth data set and all of the remaining l k data sets. It then applies SAA to each of these new pooled data sets. Perhaps surprisingly, by pooling data across the unrelated subproblems, Shrunken-SAA reduces the loss to full-information optimum by over 80% compared with SAA in this example.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Our Contributions</head><p>We describe and study the data-pooling phenomenon in stochastic optimization in the context of Problem (1.2). Our analysis applies to constrained, potentially nonconvex, nonsmooth optimization problems under fairly mild assumptions on the data-generating process. Specifically, we assume that each P k has finite support (potentially differing across k); in some cases, we can even relax this assumption. We contrast the data-pooling phenomenon to a similar phenomenon in statistics <ref type="bibr">(Stein's phenomenon)</ref>, highlighting unique features that arise in the optimization setting (see Theorem 2.2 and Example 2.3). Namely, unlike traditional statistical settings, the potential benefits of data pooling depend strongly on the structure of the underlying optimization problems, and, in some cases, data pooling may offer no benefit over decoupling.</p><p>This observation raises important questions: Given a particular data-driven instance of Problem (1.2), should we data-pool, and, if so, how? More generally, does data pooling typically offer a significant benefit over decoupling, or are instances like Figure <ref type="figure">1</ref> somehow the exception to the rule?</p><p>To help resolve these questions, we propose a simple, novel algorithm that we call Shrunken Sample Average Approximation (Shrunken-SAA). Shrunken-SAA generalizes the classical SAA algorithm and, consequently, inherits many of its excellent large-sample asymptotic properties (see Remark 4.1). Moreover, Shrunken-SAA is incredibly versatile and can be tractably applied to a wide variety of optimization problems with computational requirements similar to traditional SAA (see Remark 3.1). Unlike traditional SAA, however, Shrunken-SAA exploits the data-pooling phenomenon to improve performance over SAA, as seen in Figure <ref type="figure">1</ref>. Moreover, Shrunken-SAA exploits the structure of the optimization problems and strictly improves upon an estimate-then-optimize approach using traditional statistical shrinkage estimators (see Example 2.3 and Section 6). Notes. Consider K 10, 000 data-driven newsvendor problems, each with critical fractile 90% and 20 data points drawn independently across problems. SAA decouples the problems and orders the 90thsample quantile in each. Shrunken-SAA (see Algorithm 1 in Section 3), leverages data pooling. Indicated percentages are losses to the fullinformation optimum. Additional details in Section E.1 in Online Appendix E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gupta and Kallus: Data Pooling in Stochastic Optimization</head><p>Shrunken-SAA data-pools by combining data across subproblems in a particular fashion that is motivated by an empirical Bayesian argument. We prove that (under frequentist assumptions) for many classes of optimization problems, as the number of subproblems K grows large, Shrunken-SAA determines if pooling in this way can improve upon decoupling and, if so, also determines the optimal amount to pool (see <ref type="bibr">Theorems 4.2,</ref><ref type="bibr">4.3,</ref><ref type="bibr">4.5,</ref><ref type="bibr">and 4.6)</ref>. These theoretical results study Problem (1.2) when P k has finite, discrete support and the amount of data available for the kth subproblem is, itself, random (see Assumption 3.1). Some of our results do extend to continuous distributions (see Section 4.6 and Theorems F.1-F.3 in Online Appendix F), and numerical experiments suggest that our results are generally robust to the assumption of a random amount of data.</p><p>More interestingly, our theoretical performance guarantees for Shrunken-SAA, hold even when the expected amount of data per subproblem is small and fixed and the number of problems K is large, as in Figure <ref type="figure">1</ref>; that is, they hold in the so-called smalldata, large-scale regime <ref type="bibr">(Gupta and Rusmevichientong 2021)</ref>. Indeed, since many traditional data-driven methods (including SAA) converge to the full-information optimum in the large-sample regime, the small-data, largescale regime is arguably the more interesting regime in which to study the benefits of data pooling.</p><p>In light of the aforementioned results, Shrunken-SAA provides an algorithmic approach to deciding if, and by how much, to pool. To develop an intuitive understanding of when and why data pooling might improve upon decoupling, we also introduce the Sub-Optimality-Instability Trade-Off, a decomposition of the benefits of data pooling. We show that the performance of a data-driven solution to Problem (1.2) (usually called its out-of-sample performance in machinelearning settings) can be decomposed into a sum of two terms: a term that roughly depends on its insample suboptimality and a term that depends on its instability; that is, how much does in-sample performance change when training with one fewer data points? As we increase the amount of data pooling, we increase the in-sample suboptimality because we "pollute" the kth subproblem with data from other, unrelated subproblems. At the same time, however, we decrease the instability of the kth subproblem, because the solution no longer relies on its own data so strongly. Shrunken-SAA works by navigating this trade-off, seeking a "sweet spot" to improve performance. (See Section 5 for discussion.)</p><p>In many ways, the Sub-Optimality-Instability Trade-Off resembles the classical bias-variance trade-off from statistics. However, they differ in that the Sub-Optimality-Instability Trade-Off applies to general optimization problems, whereas the bias-variance trade-off applies specifically to the case of mean-squared error. Moreover, even in the special case when Problem (1.2) models mean-squared error, we prove that these two tradeoffs are distinct (see Section D.2 in Online Appendix D). In this sense, the Sub-Optimality-Instability Trade-Off may be of independent interest outside data pooling.</p><p>Stepping back, this simple intuition suggests that Shrunken-SAA, and data pooling more generally, offer significant benefits whenever the decoupled solutions to the subproblems are sufficiently unstable, which typically happens when there is only a small amount of relevant data per subproblem. It is in this sense that the behavior in Figure <ref type="figure">1</ref> is typical and not pathological. Moreover, this intuition also naturally extends beyond Shrunken-SAA, paving the way to developing and analyzing new algorithms that also exploit the hitherto-underutilized data-pooling phenomenon.</p><p>Finally, we present numerical evidence in an inventory management context using real data from a chain of European drug stores showing that Shrunken-SAA can offer significant benefits over decoupling when the amount of data per subproblem is small to moderate. These experiments also suggest that Shrunken-SAA's ability to identify an optimal amount of pooling and improve upon decoupling are relatively robust to violations of our assumptions on the data-generating process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Connections to Prior Work</head><p>Our proposal, Shrunken-SAA, generalizes SAA. In many ways, SAA is the most fundamental data-driven approach to Problem (1.1). SAA proxies P in (1.1) by the empirical distribution P on the data and optimizes against P. It enjoys strong theoretical and practical performance in the large-sample limit, that is, when N is large <ref type="bibr">(Kleywegt et al. 2002</ref><ref type="bibr">, Shapiro et al. 2009)</ref>. For data-driven newsvendor problems, specifically-an example of which we use throughout our work-SAA is the maximum likelihood estimate of the optimal solution and also is the distributionally robust optimal solution for a Wasserstein ambiguity set <ref type="bibr">(Esfahani and Kuhn 2018, p. 151)</ref>. SAA is incredibly versatile and applicable to a wide variety of classes of optimization problems. This combination of strong performance and versatility has fueled SAA's use in practice. Applied to Problem (1.2), SAA decouples the optimization into its K subproblems. Thus, because of its strong theoretical and practical performance, we use SAA as the natural, "apples-to-apples" decoupled benchmark to which we compare our data-pooling procedures.</p><p>The data-pooling phenomenon for stochastic optimization is also closely related to Stein's phenomenon in statistics <ref type="bibr">(Stein 1956</ref>; see also <ref type="bibr">Efron and Hastie 2016</ref> for a modern overview). <ref type="bibr">Stein (1956)</ref> considered estimating the mean of K normal distributions, each with known variance &#963; 2 , from K data sets. The kth data set is drawn i.i.d. from the kth normal distribution, and draws are independent across k. The natural decoupled solution to the problem (and the maximum likelihood estimate) is to use the kth sample mean as an estimate for the kth distribution. Surprisingly, whereas this estimate is optimal for each problem separately in a very strong sense (uniformly minimum variance unbiased and admissible), <ref type="bibr">Stein (1956)</ref> describes a pooled procedure that always outperforms this decoupled procedure with respect to total meansquared error whenever K &#8805; 3.</p><p>The proof of Stein's result is remarkably short, but arguably opaque. Many textbooks refer to it as "Stein's paradox," perhaps because it is not clear what drives the result. Why does it always improve upon decoupling? What is special about K 3? Is the key the normality assumption? The common variance assumption? The structure of mean-squared error? All of the above?</p><p>Many authors have tried to develop simple intuition for <ref type="bibr">Stein's result (e.g., Brown 1971</ref><ref type="bibr">, Efron and Morris 1977</ref><ref type="bibr">, Stigler 1990</ref><ref type="bibr">, Beran 1996</ref><ref type="bibr">, Brown and Zhao 2012)</ref> with mixed success. As a consequence, although Stein's phenomenon has had tremendous impact in statistics, it has, in our humble opinion, had a fairly limited impact on data-driven optimization. It is simply not clear how to generalize Stein's original algorithm to optimization problems different from minimizing mean-squared error. Indeed, the few data-driven optimization methods that attempt to leverage shrinkage apply either to quadratic optimization (e.g., <ref type="bibr">Jorion 1986</ref><ref type="bibr">, DeMiguel et al. 2013</ref><ref type="bibr">, Davarnia and Cornu&#233;jols 2017)</ref> or else under Gaussian or near-Gaussian assumptions <ref type="bibr">(Mukherjee et al. 2015, Gupta and</ref><ref type="bibr">Rusmevichientong 2021)</ref>, both of which are very close to Stein's original setting.</p><p>By contrast, our analysis of the data-pooling phenomenon requires very mild distributional assumptions and applies to constrained, potentially nonconvex, nonsmooth optimization problems. Numerical experiments in Section 6 further suggest that even our few assumptions are not crucial to the data-pooling phenomenon. Moreover, our proposed algorithm, Shrunken-SAA, is extremely versatile and can be applied in any setting in which SAA can be applied.</p><p>Finally, we note that (in)stability has been well studied in the machine-learning community (see, e.g., <ref type="bibr">Bousquet and Elisseeff 2002</ref><ref type="bibr">, Shalev-Shwartz et al. 2010</ref><ref type="bibr">, Yu 2013, and references therein)</ref>. <ref type="bibr">Shalev-Shwartz et al. (2010)</ref>, in particular, argue that stability is the fundamental feature of data-driven algorithms that enables learning. Our Sub-Optimality-Instability Trade-Off connects the data-pooling phenomenon in stochastic optimization to this larger statistical concept. To the best of our knowledge, however, existing theoretical analyses of stability focus on the large-sample regime. Ours is the first work to leverage stability concepts in the small-data, large-scale regime. From a technical perspective, this analysis requires somewhat different tools. We define Nmax max k Nk and Navg &#8801; 1 K &#8721; K k 1 Nk . Finally, let pk &#8801; mk / Nk be the empirical distribution for the kth subproblem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Model</head><p>Notice that we have used &#8226; notation when denoting Nk and conditioned on its value in specifying the distribution of mk . This is because, in our subsequent analysis, we will sometimes view the amount of data available for each problem as random (see Section 3.2). When the data are fixed and nonrandom, we condition on Nk explicitly to emphasize this fact.</p><p>With this notation, we can rewrite our target optimization problem:</p><p>Our goal is to identify a data-driven policy, that is,</p><p>) is small. We stress that the performance of a data-driven policy is random because it depends on the data.</p><p>As mentioned, with full information of p k , Problem (2.2) decouples across k and, after decoupling, no longer depends on the frequency weights &#955; k K&#955; avg . Our proposed algorithms will also not require knowledge A canonical policy to which we will compare is the Sample Average Approximation (SAA) policy, which proxies the solution of these decoupled problems by replacing p k with pk :</p><p>As we will see, SAA is closely related to our proposed algorithm Shrunken-SAA and hence provides a natural (decoupled) benchmark when assessing the value of data pooling.</p><p>Finally, we use the newsvendor problem as a running example in what follows. We say that the kth subproblem is a newsvendor problem with critical fractile 0</p><p>. Its full-information solution is the sth quantile of the kth distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">A Bayesian Perspective of Data Pooling</head><p>To motivate data pooling, we first consider a Bayesian approximation to our problem. Specifically, suppose that each p k were independently drawn from a common Dirichlet prior, that is,</p><p>with &#945; 0 &gt; 0 and p 0 &#8712; &#916; d , the d-dimensional simplex.</p><p>The Bayes-optimal decision minimizes the posterior risk, which is</p><p>, by linearity. Furthermore, by independence and conjugacy, respectively,</p><p>(2.4)</p><p>For any non-data-driven &#945; and p 0 , x k (&#945;, p 0 , mk ) depends on mk but not on ml for l k. This policy has an appealing, intuitive structure. Notice that pk (&#945;) overloads notation slightly and is a convex combination between pk pk (0), a data-based estimate of p k , and p 0 , an a priori estimate of p k . In traditional statistical parlance, we say that pk (&#945;) shrinks the empirical distribution pk toward the anchor p 0 . The Bayes-optimal solution is the plug-in solution when using this shrunken empirical measure; that is, it optimizes x k as though that were the known true measure. This differs from the SAA solution, which is the plug-in solution when using the "unshrunken" pk .</p><p>The parameter &#945; controls the degree of shrinkage. As &#945; &#8594; 0, x k (&#945;, p 0 , m) converges to an SAA solution, and as &#945; &#8594; &#8734;, x k (&#945;, p 0 , m) converges to the (nonrandom) solution to the fully shrunken kth subproblem. In this sense, the Bayes-optimal solution "interpolates" between the SAA solution and the fully shrunken solution. The amount of data Nk attenuates the amount of shrinkage; that is, subproblems with more data are shrunk less aggressively for the same &#945;.</p><p>Alternatively, we can give a data-pooling interpretation of x k (&#945;, p 0 , mk ) via the Bayesian notion of pseudocounts. Observe that x k (&#945;, p 0 , mk ) &#8712; arg min</p><p>Nk +&#945; is a distribution on {a k1 , . . . , a kd }. In other words, we can interpret x k (&#945;, p 0 , mk ) as the solution obtained when we augment each of our original K data sets with &#945; additional "synthetic" data points with counts &#945;p 0 . As we increase &#945;, we add more synthetic data.</p><p>For &#945; &gt; 0, x k (&#945;, p 0 , 0) is the solution to the fully shrunken kth subproblem. For emphasis, let</p><p>so that x k (&#945;, p 0 , 0) x k (&#8734;, p 0 ) for all &#945; &gt; 0. For completeness, we also define x k (0, p 0 , 0) x k (&#8734;, p 0 ), so that</p><p>In summary, x k (&#945;, p 0 , mk ) has an intuitive structure that is well defined regardless of the precise structure of the cost functions c k (&#8226;) or feasible region X . Importantly, this analysis shows that, when the p k 's follow a Dirichlet prior, data pooling by &#945; is never worse than decoupling, and will be strictly better whenever x SAA k ( mk ) is not an optimal solution to the problem defining x k (&#945;, p 0 , mk ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Data Pooling in a Frequentist Setting</head><p>It is perhaps unsurprising that data pooling improves upon the decoupled SAA solution in the Bayesian setting, because problems l k contain information about &#945; and p 0 , which, in turn, contain information about the p k 's. However, even in frequentist settings, that is, when the p k 's are fixed constants that may have no relationship to one another and there is no "groundtruth" values for &#945; or p 0 , x(&#945;, p 0 , m) can still improve upon SAA through a careful choice of &#945; and p 0 that depend on all the data. This is the heart of Stein's result for Gaussian random variables and meansquared error. </p><p>Fix any p 0 &#8712; &#916; d and &#945; &#8805; 0 (not depending on the data). A direct computation shows that</p><p>where &#956;k 1 N &#8721; N i 1 &#958;ki is the usual sample mean and &#956; k0 p 0 a k . Notice, in particular, that the decoupled SAA solution is x SAA ( &#956;1 , . . . , &#956;K ), corresponding to &#945; 0.</p><p>For any p 0 and &#945;, the objective value of x(&#945;, p 0 , m) is</p><p>by the usual bias-variance decomposition of meansquared error (MSE). This objective is the average of K independent random variables. Hence, we might intuit that, under appropriate regularity conditions (see Theorem 2.1) and conditional on N, as K &#8594; &#8734;,</p><p>)</p><p>, again using the bias-variance decomposition of MSE. We can minimize the right-hand side over &#945; explicitly, yielding the value</p><p>where AP stands for a priori, meaning &#945; AP p 0 is the onaverage-best a priori choice of shrinkage before observing any data. In particular, substituting &#945; 0 and &#945; &#945; AP p 0 into the second term of Example 2.5 shows that, up to a term that is vanishing as K &#8594; &#8734;, shrinking by &#945; AP p 0 decreases the MSE by</p><p>This benefit is strictly positive for any values of p k and p 0 , and increasing in &#945; AP p 0 . Unfortunately, we cannot implement x(&#945; AP p 0 , p 0 , m) in practice, because &#945; AP p 0 is not computable from the data; it depends on the unknown &#956; k and &#963; 2 k . The next theorem shows that we can, however, estimate &#945; AP p 0 from the data in a way that achieves the same benefit as K &#8594; &#8734;, even if N is fixed and small. See Online Appendix A for proof.</p><p>Theorem 2.1 (Data Pooling for MSE). Consider a sequence of subproblems, indexed by k 1, 2, . . .. Suppose for each k that the k th subproblem minimizes mean-squared error; that is, p k is supported on {a k1 , . . . , a kd } &#8838; R, X k R, and c ki (x) (xa ki ) 2 . Suppose further that there exists &#955; avg , N &#8805; 2, and a max &lt; &#8734; such that &#955; k &#955; avg , Nk N, and a k &#8734; &#8804; a max for all k. Fix any p 0 &#8712; &#916; d , and let</p><p>Then, conditional on N, as K &#8594; &#8734;,</p><p>. In this form, we can see that the resulting estimator with pooling &#945; JS p 0 strongly resembles the classical James-Stein mean estimator (see <ref type="bibr">Efron and Hastie 2016, equation (7.51</ref>)), with the exception that we have replaced the variance &#963; 2 k , which is assumed to be 1 in Stein's setting, with the usual, unbiased estimator of that variance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gupta and Kallus: Data Pooling in Stochastic Optimization</head><p>This resemblance motivates our "JS" notation. Theorem 2.1 is neither stronger nor weaker than the James-Stein theorem. Our result applies to non-Gaussian random variables and holds in probability, but is asymptotic; the James-Stein theorem requires Gaussian distributions and holds in expectation, but applies to any fixed K &#8805; 3. Theorem 2.1 shows that data pooling for meansquared error always offers a benefit over decoupling for sufficiently large K, no matter what the p k may be. Data pooling for general optimization problems, however, exhibits more subtle behavior. In particular, as shown in the following example and theorem, there exist instances where data pooling offers no benefit over decoupling, as well as instances where data pooling may be worse than decoupling.</p><p>Example 2.2 (Data Pooling for Simple Newsvendor).</p><p>Consider a special case of Problem (2.2) such that, for all k,</p><p>In words, the kth subproblem estimates the median of a Bernoulli random variable by minimizing mean absolute deviation or, equivalently, is a newsvendor problem with critical fractile 0.5 for Bernoulli demand. We order the support so that p k1 P(&#958; k 1), as is typical for a Bernoulli random variable. Suppose further that, for each k, p k1 &gt; 1 2 , and fix any p 01 &lt; 1 2 . Note that x k (&#945;, p 0 , mk ) I[ pk1 &#8805; 1 2 + &#945;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Nk</head><p>( 1 2p 01 )]. 2 Further, for any &#945; (possibly depending on m),</p><p>where the last equality follows since</p><p>Notice that p k1 &gt; 1 2 &#8658; (2p k1 -1) &gt; 0, so this last expression is nonnegative. It follows that, path by path, shrinkage by any &#945; &gt; 0 cannot improve upon the decoupled solution (&#945; 0). Moreover, if x k (&#945;, p 0 , mk ) x k (0, p 0 , mk ), then the performance is strictly worse. If we had chosen p 01 &#8805; 1 2 and p k1 &lt; 1 2 , then a similar result holds.</p><p>We summarize this example in the following theorem.</p><p>Theorem 2.2 (Data Pooling Does Not Always Offer Benefit). Given any p 0 , there exist instances of Problem (2.2) such that shrinkage does not outperform the (decoupled) SAA solution. Moreover, if x(&#945;, p 0 , m) performs the same as SAA, then x(&#945;, p 0 , m) is, itself, an SAA solution.</p><p>On the other hand, there exist examples where the James-Stein estimator and traditional statistical reasoning might suggest the benefits of pooling are marginal, but, by data pooling in a way that exploits the optimization structure, we can achieve significant benefits. Specifically, our Bayesian motivation in Section 2.1 suggests pooling offers little benefit when the p k 's are very dispersed; that is, the Dirichlet prior has high variance and &#945; 0 is small. Similarly, Theorem 2.1 and <ref type="bibr">Efron and Morris (1977)</ref> both suggest that the benefits of pooling over decoupling for MSE are marginal if the subproblem means are quite dispersed (see Equation (2.6)). Nonetheless, for general optimization problems, we observe that pooling might still offer substantive benefits in these situations.</p><p>Example 2.3. (Pooling Can Offer Benefit Even When p k 's Are Dispersed). Let d &gt; 3, and fix some 0 &lt; s &lt; 1. Suppose that the k th subproblem is a newsvendor problem with critical fractile f k &gt; s and demand distribution supported on the integers 1, . . . , d. For each k, let p k1 0, p kd 1s, and p kj k s for some 1 &lt; j k &lt; d. Consider the fixed anchor p 01 s, p 0d 1s, and p 0j 0 for 1 &lt; j &lt; d. Notice that typical p k 's are very far from</p><p>, which is the maximal distance between two points on the simplex. In other words, the p k 's are not very similar. Moreover, the means are also dispersed for s close to 1 since 1</p><p>if the j k 's are chosen uniformly. Consequently, the James-Stein estimator does not shrink very much in this example. A straightforward computation shows that, for K sufficiently large,</p><p>with high probability, which is close to 0 for s close to 1. However, the full-information solution for the k th problem is x * k d, which also equals the fully pooled (&#945; &#8734;) solution, x k (&#8734;, p 0 ). Hence, pooling in an optimization-aware way can achieve fullinformation performance, whereas both decoupling and an "estimate-then-optimize" approach using James-Stein shrinkage necessarily perform worse. In other words, pooling offers significant benefits despite the p k 's being as dispersed as possible, because of the optimization structure, and leveraging this structure is necessary to obtain the best shrinkage. &#9633; Theorems 2.1 and 2.2 and Examples 2.2 and 2.3 highlight the fact that data pooling for general optimization is more complex than Stein's phenomenon. In particular, in Stein's classical result for meansquared error and Gaussian data, data pooling always offers a benefit for K &#8805; 3. For other optimization problems and data distributions, data pooling may not offer a benefit, or it may offer a benefit but requires a new way of choosing the pooling amount. This raises two important questions: First, how do we identify if an instance of Problem (2.2) would benefit from data pooling? Second, if it does, how do we compute the "optimal" amount of pooling? In the next sections, we show how our Shrunken-SAA algorithm can be used to address both questions in the relevant regime, where K is large but the average amount of data per subproblem remains small. Indeed, we show that Shrunken-SAA achieves the bestpossible shrinkage in an optimization-aware fashion for many types of problems and choices of anchor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Shrunken SAA Algorithm</head><p>Algorithm 1 (The Shrunken-SAA Algorithm)</p><p>Input: Data S k { &#958;k1 , . . . , &#958;k Nk }, k 1, . . . , K, and an anchor distribution h(S) Fix a finite grid A &#8838; [0, &#8734;) for &#945; &#8712; A, k 1, . . . , K, j 1, . . . , Nk define:</p><p>Algorithm 1 formally defines Shrunken-SAA. The crucial step is the "Modified LOO Cross-Validation," which we discuss in detail in Sections 3.2 and 3.3. To highlight similarities to SAA, we have stated the algorithm in terms of the data sets S k and S (S 1 , . . . , S K ). Here, h(S) represents an arbitrary, possibly data-driven anchor distribution (we present some examples, shortly). Recall that we can equivalently express S k in terms of the counts mk . In that notation, we recognize that if the j th data point of S k is a ki , then x k,-j (&#945;, h( m)) x k (&#945;, h( m), mke i ) and x S-SAA k x k (&#945; S-SAA , h( m), mk ). In other words, Shrunken-SAA retains the particular pooling structure of Equation (2.4) suggested by our Bayesian argument, but it allows for a data-dependent anchor h(S) (equivalently, h( m)) and chooses the amount of pooling via a particular cross-validation scheme. We present Algorithm 1 using a finite grid A, but our theory also pertains to A [0, &#8734;).</p><p>Remark 3.1 (Computational Complexity). Computationally, Algorithm 1 does not depend on d, the size of the support of &#958; k . Its bottleneck is computing x k,-j , which is similar to solving the k th subproblem by SAA with an augmented data set described by h(S). More specifically, Algorithm 1 depends on the data only through h(S) and averages of functions over subsets of S, neither of which explicitly depends upon d. Consequently, although our setup and analysis assume that &#958; k has finite discrete support, from an implementation perspective, we can apply Shrunken-SAA when &#958; k has continuous support without discretization, so long as we can efficiently solve these augmented SAA problems (see our empirical study in Section E.6 of Online Appendix E). From a theoretical perspective, some of our analysis extends to this continuous setting (see Section 4.6). In the remainder, we follow Section 2 and treat the data as discrete, referring to the data by mk and m.</p><p>We consider Shrunken-SAA to be roughly as tractable as SAA. We say "roughly" because, in the worst case, one must solve at most A | | &#8721; K k 1 min(d, Nk ) problems in the LOO cross-validation step, which, if we sample from h( m), have a similar structure to SAA. Fortunately, we can parallelize these problems in distributed computing environments and use previous iterations to "warm-start" solvers. Moreover, in Section E.8 of Online Appendix E, we observe empirically that less computationally expensive &#954;-fold cross-validation procedures can be used in place of LOO with similar performance. &#9633; For clarity, the &#945; S-SAA h parameter (with A [0, &#8734;)) computed by Algorithm 1 is</p><p>) . (3.1)</p><p>The Anchor Distribution h( m) As stated, the anchor in Algorithm 1, h( m), is an input. We think of h( m) as a function that selects an anchor distribution from a candidate set of distributions P. In what follows, we will focus on two types of anchors and corresponding candidate sets P:</p><p>&#8226; Fixed anchors: Here, h( m) p 0 , P {p 0 } for some fixed p 0 , for example, the uniform distribution p 0 e/d. Fixed anchors might be used for computational/statistical simplicity or when there is strong a priori knowledge of a good anchor. In this case, we abuse notation slightly, replacing the map h : m &#8594; p 0 with the constant p 0 when clear from context; for exampe, we write &#945; S-SAA p 0 for &#945; S-SAA h .</p><p>&#8226; Data-driven anchors: Here, h( m) is any procedure that uses the data m to select a distribution, and P is the image of h(&#8226;). One example might be to use all the data to fit a parametric distribution, for example, a lognormal distribution, via maximum likelihood and use this fitted distribution as the anchor. Then, P would be the set of lognormal distributions.</p><p>We also pay particular focus to two special cases of data-driven anchors in what follows:</p><p>&#8226; LOO-optimized anchor: For a given P &#8838; &#916; d , let</p><p>We will see later that h P satisfies stronger optimality properties than general data-driven anchors and, hence, we treat it separately. From an implementation point of view, when applying Algorithm 1, we only ever require the value of h P ( m), not the full-function h P (&#8226;). Thus, Algorithm 1 with h P (&#8226;) amounts to replacing the "Modified LOO Cross-Validation" step by a joint optimization over anchor and pooling amount:</p><p>) .</p><p>(3.3)</p><p>We note that the multivariate optimization problem in Section 3.3 may be challenging depending on the structure of P, motivating our second special case:</p><p>&#8226; GM-anchor: We also consider a computationally simpler "grand-mean" anchor h( m) pGM , where</p><p>Nmax &gt; 0 and e/d otherwise. (For this data-driven anchor, P &#916; d .) This choice is motivated by our Bayesian perspective on data pooling from Section 2.1. In the Bayesian setting, pGM is an unbiased estimator of the prior mean. We observe empirically in Section 6 that pGM is a strong and computationally efficient heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Oracle Benchmarks</head><p>From Theorem 2.2, data pooling need not improve upon decoupling for a given h(&#8226;). To establish appropriate benchmarks, we first define the oracle pooling for given h(&#8226;); that is,</p><p>) , where</p><p>Notice that &#945; OR h is random, depending on the entire data-sequence. By construction, Z K (&#945; OR h , h( m)) lower bounds the performance of any other data-driven pooling policy with anchor h( m) path by path. Hence, it serves as a strong performance benchmark. However, &#945; OR h also depends on the unknown p k and &#955; k , and, hence, is not implementable in practice. In this sense, it is an oracle.</p><p>Given any &#945; (possibly depending on the data), we measure the suboptimality of pooling by &#945; relative to the oracle pooling for h(&#8226;) on a particular data realization by</p><p>Good pooling procedures have small suboptimality with high probability with respect to the data. We allow for &#945; OR h 0. Thus, procedures with small suboptimality still have good performance in instances when data pooling is not beneficial. Studying when &#945; OR h &gt; 0 further gives intuition into when and why data pooling is helpful, a task we take up in Section 5.</p><p>The aforementioned oracle is defined with respect to a given anchor. One might also seek to benchmark performance relative to the best-possible anchor. Given any P &#8838; &#916; d , we define the oracle choice of anchor and pooling amount for anchors in P and for a particular data realization by</p><p>Then, given any anchor q &#8712; P and pooling amount &#945; (both possibly depending the data), we measure the suboptimality of shrinking by &#945; toward q by</p><p>For clarity, we observe that, by construction, &#945; OR P &#945; OR q OR P .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Motivating &#945; S-SAA Through Unbiased Estimation</head><p>We first consider a fixed anchor h( m) p 0 . In this case, we abuse notation slightly and write</p><p>One approach to choosing &#945; p 0 might be to construct a suitable proxy for Z K (&#945;, p 0 ) in Equation (3.6) based only on the data and then choose the &#945; p 0 that optimizes this proxy.</p><p>If we knew the values of &#955; k , then a natural proxy might be to replace the unknown p k with pk ; that is, optimize 1</p><p>, since both pk and x k (&#945;, p 0 , mk ) depend on the data mk . Worse, this bias wrongly suggests that &#945; 0, that is, decoupling, is always a good policy, because x k (0, p 0 , mk ) always optimizes this proxy, by construction. By contrast, Theorem 2.1 shows that data pooling can offer significant benefits. This type of bias and its consequences are well known in other contexts and are often termed the "optimizer's curse"-in-sample costs are optimistically biased and may not generalize well.</p><p>These features motivate us to seek an unbiased estimate of Z K (&#945;, p 0 ). At first glance, however, Z K (&#945;, p 0 ), which depends on both the unknown p k and unknown &#955; k , seems particularly intractable unless x k (&#945;, p 0 , mk ) admits a closed-form solution as in Example 2.1. A key observation is that, in fact, Z K (&#945;, p 0 ) does more generally admit an unbiased estimator, if we also introduce an additional assumption on our data-generating mechanism, that is, that the amount of data is random.</p><p>Assumption 3.1 (Randomizing Amount of Data). There exists an N such that Nk &#8764; Poisson(N&#955; k ) for each k 1, . . . , K.</p><p>Under Assumption 3.1, (unconditional) expectations and probabilities should be interpreted as over both the random draw of Nk and the counts mk .</p><p>Analytically, the benefit of Assumption 3.1 is that it breaks the dependence across i in mk . Namely, by the Poisson-splitting property, under Assumption 3.1,</p><p>and the mki 's are independent across i and k. If Nk were nonrandom, then these mki would be dependent.</p><p>Beyond its analytical convenience, we consider Assumption 3.1 to be reasonable in many applications. Consider, for instance, a retailer optimizing the price of k distinct products; that is, x k represents the price of product k, &#958; k represents the (random) valuation of a typical customer, and c k (x k , &#958; k ) is the (negative) profit earned. In such settings, one frequently ties data collection to time; that is, one might collect N 6 months worth of data. To the extent that customers arrive seeking product k in a random fashion, the number of arrivals Nk that one might observe in N months is, itself, random and is reasonably modeled as Poisson with rate proportional to N. Similar statements apply whenever data for problem k are generated by an event that occurs randomly, for example, when observing the response time of emergency responders (disasters occur intermittently), effectiveness of a new medical treatment (patients with the relevant disease arrive sequentially), or any aspect of a customer service interaction (customers arrive randomly to service).</p><p>In some ways, this perspective tacitly underlies the formulation of Problem (2.2) itself. Indeed, one way to interpret the subproblem weights</p><p>the total long-run costs. However, if problems of type k occur at rate &#955; k , it should be that observations of type k, that is, realizations of &#958; k , also occur at rate &#955; k , supporting Assumption 3.1.</p><p>In settings where data collection is not tied to randomly occurring events, modeling Nk as Poisson may still be a reasonable approximation if d is large relative to Nk and each of the individual p ki 's are small. Indeed, under such assumptions, a Multinomial( Nk , p k ) is well approximated by independent Poisson random variables with rates Nk p ki , i 1, . . . d (see <ref type="bibr">McDonald 1980, Deheuvels and</ref><ref type="bibr">Pfeifer 1988</ref> for a formal statement). In this sense, we can view the consequence of Assumption 3.1 as a useful approximation to the setting where the Nk 's are fixed, even if it is not strictly true.</p><p>In any case, under Assumption 3.1, we develop an unbiased estimate for Z K (&#945;, p 0 , m). We use the following identity <ref type="bibr">(Chen 1975)</ref>. For any f : Z + &#8594; R, for which the expectations exist,</p><p>The proof of the identity is immediate from the Poisson probability mass function. 3  Now, for any &#945; &#8805; 0 and q &#8712; &#916; d , define</p><p>and</p><p>(3.8) Lemma 3.1. (An Unbiased Estimator for Z K (&#945;, p 0 )). Under Assumption 3.1, we have for any &#945; &#8805; 0 and</p><p>Taking expectations of both sides, summing over i 1, . . . , d, and scaling by N&#955; avg proves</p><p>Finally, averaging this last equality over k completes the lemma.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gupta and Kallus: Data Pooling in Stochastic Optimization</head><p>We therefore propose selecting &#945; by minimizing the estimate Z LOO K (&#945;, p 0 ). As written, Z LOO K (&#945;, p 0 ) still depends on the unknown N and &#955; avg ; however, these values occur multiplicatively and are positive, and so do not affect the optimizer. Hence, the optimizer is exactly &#945; S-SAA h , as in Equation (3.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Motivating &#945; S-SAA via Modified</head><p>Leave-One-Out Cross-Validation Although we motivated Equation (3.1) via an unbiased estimator, we can alternatively motivate it through leave-one-out cross-validation. This latter perspective informs our "LOO" notation. Indeed, consider again our decision maker, and assume in line with Assumption 3.1 that subproblems of type k arrive randomly according to a Poisson process with rate &#955; k , independently across k. When a problem of type k arrives, she incurs a cost c k (x k , &#958;). Again, the objective of Problem (2.2) thus represents her expected longrun costs.</p><p>We can alternatively represent her costs via the modified cost function C(x 1 , . . . , x K , &#954;, &#958;) c &#954; (x &#954; , &#958;), where &#954; is a random variable indicating which of the k subproblems she is currently facing. In particular, letting P(&#954; k) The grand data set can be seen as i.i.d. draws of (&#954;, &#958;).</p><p>For a fixed &#945; and p 0 , the leave-one-out estimate of</p><p>is given by removing one data point from the grand data set, training x 1 (&#945;, p 0 , &#8226;), . . . , x K (&#945;, p 0 , &#8226;) on the remaining data, and evaluating C(&#8226;) on the left-out point using these policies. We repeat for each point in the grand data set and average. We can rewrite this leave-oneout estimate as</p><p>) ,</p><p>which agrees with the objective of Equation (3.1) up to a positive multiplicative constant. Although this multiplicative constant does not affect the choice of &#945; S-SAA , it does cause the traditional leave-oneout estimator to be biased. This bias agrees with folklore results in machine learning that assert that leave-one-out does generally exhibit a small bias <ref type="bibr">(Hastie et al. 2001)</ref>.</p><p>For data-driven anchors, we stress that, unlike traditional leave-one-out validation, we do not use one fewer points when computing the anchor in Algorithm 1; we use h( m) for all iterations. Hence, Shrunken-SAA is not strictly a leave-one-out procedure, motivating our qualifier "Modified."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Performance Guarantees</head><p>for Shrunken-SAA</p><p>In this section, we show that, in the limit where the number of subproblems K grows, shrinking by &#945; S-SAA h is essentially best possible. More precisely, for any K &#8805; 2 and any 0 &lt; &#948; &lt; 1/2, with probability at least 1 -&#948;, we prove that</p><p>where the &#213;(&#8226;) notation suppresses logarithmic factors in K, and 1 &lt; &#946; &lt; 2 is a constant that depends on the particular class of optimization problems under consideration. Imporantly, by the Borel-Cantelli lemma, Equation (4.1) implies SubOpt h,K (&#945; S-SAA h ) &#8594; 0, almost surely as K &#8594; &#8734;, even if the expected amount of data per subproblem remains fixed.</p><p>Equation (4.1) asserts that for a given anchor h(&#8226;), Shrunken-SAA achieves the best possible shrinkage amount as K &#8594; &#8734;. We will also prove similar bounds on SubOpt P,K (&#945; S-SAA h , h P ( m)). Such bounds assert that, for a given class P, Shrunken-SAA with h P (&#8226;) achieves the best possible anchor and shrinkage amount simultaneously.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Overview of Proof Technique</head><p>To prove performance guarantees like Equation (4.1), we first bound the suboptimality of Shrunken-SAA in terms of the maximal stochastic deviations of Z K (&#945;, h) and Z LOO K (&#945;, h) from their means.</p><p>Lemma 4.1 (Bounding Sub-Optimality). Suppose that Assumption 3.1 holds.</p><p>For a non-data-driven anchor h( m) p 0 , Similarly, for a general data-driven anchor with h( m) &#8712; P,</p><p>Finally, for h h P , SubOpt P,K (&#945; S-SAA h P , h P ( m)) is also bounded by the right-hand side of Equation (4.2).</p><p>Proof. By definition of &#945; S-SAA</p><p>By Lemma 3.1, the last term is zero, which establishes the first statement. The proof of the second statement is similar, but, in the second inequality, we take an additional supremum over q &#8712; P to replace h( m).</p><p>The proof of the third statement is similar, using</p><p>(&#945; OR P , q OR P ), and taking a supremum over &#945; &#8805; 0, q &#8712; P in the second inequality.</p><p>Proving a performance guarantee for &#945; S-SAA h thus reduces to bounding the maximal deviations in the lemma. Recall that Z K (&#945;, q)</p><p>Both processes have a special form: they are the empirical average of K independent stochastic processes (indexed by k). Fortunately, there exist standard tools to bound the maximal deviations of such empirical processes that rely on bounding their metric entropy.</p><p>To keep our paper self-contained, we summarize one such approach presented in <ref type="bibr">Pollard (1990)</ref>, specifically in equation (7.5) of that work. Recall that, for any set S &#8838; R d , the -packing number of S, denoted by D( , S), is the largest number of elements of S that can be chosen so that the Euclidean distance between any two is at least . Intuitively, packing numbers describe the size of S at scale . Theorem 4.1 (A Maximal Inequality <ref type="bibr">;</ref><ref type="bibr">Pollard 1990)</ref>. Let W(t) (W 1 (t), . . . , W K (t)) &#8712; R K be a stochastic process indexed by t &#8712; T , and let W K (t)</p><p>+ be a random variable such that W k (t) &#8402; &#8402; &#8402; &#8402; &#8804; F k for all t &#8712; T , k 1, . . . , K. Finally, define the random variable</p><p>Then, for any p &#8805; 1 and any 0 &lt; &#948; &lt; 1, with probability at least 1 -&#948;,</p><p>If T is finite, then one can bound the maximal deviation with a union bound. Theorem 4.1 extends beyond this simple case to cases where T | | &#8734;. The random variable F in the theorem is called an envelope for the process W(t). The random variable J is often called the Dudley integral. Whereas packing numbers describe the size of a set at scale , the Dudley integral roughly describes the size of the set at varying scales. We again refer the reader to <ref type="bibr">Pollard (1990)</ref> for discussion.</p><p>Our overall proof strategy is to use Theorem 4.1 to bound the two suprema in Lemma 4.1 and thus obtain a bound on the suboptimality. Specifically, define the following stochastic processes:</p><p>) .</p><p>Our proof strategy will be to (1) compute envelopes for both processes; (2) compute the packing numbers and Dudley integrals for the relevant aforementioned sets;</p><p>(3) apply Theorem 4.1 to bound the relevant maximal deviations; and (4) use these bounds in Lemma 4.1 to bound the suboptimality. We execute this strategy for several special cases in the remainder of the section. As a first step, we identify envelopes for each process. We restrict attention to the case where the optimal value of each subproblem is bounded for any choice of anchor and shrinkage.</p><p>Assumption 4.1 (Bounded Optimal Values). There exists C such that for all i 1, . . . , d, and k 1 . . . , K, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Data-Driven Anchors and Strongly</head><p>Convex Problems We next consider the case of a data-driven anchor h( m) &#8712; P. Our performance guarantees will depend on the complexity of P as measured by the size of its 1 -packing numbers. Namely, we let D 1 ( , P) be the largest number of elements of P that can be chosen so that the 1 -distance between any two is at least . 5 Then, we have the following.</p><p>Theorem 4.3 (Shrunken-SAA with Data-Driven Anchors for Strongly Convex Problems). Suppose that Assumptions 3.1, 4.1, and 4.2 hold, K &#8805; 2. Let d 0 &#8805; 1 be such that for any 0 &lt; &lt; 1/2, log D 1 ( , P) &#8804; d 0 log(1/ ). Then, there exists a universal constant A such that for any 0 &lt; &#948; &lt; 1/2, with probability at least 1 -&#948;, we have that</p><p>In the special case of h P (&#8226;), we can prove an even stronger result, namely, that Shrunken-SAA with h P performs comparably to pooling in an optimal way to the best anchor within the class P.</p><p>Theorem 4.4. (Shrunken-SAA with h P for Strongly Convex Problems). Under the assumptions of Theorem 4.3, there exists a universal constant A such that for any 0 &lt; &#948; &lt; 1/2, with probability at least 1 -&#948;, we have that</p><p>In both theorems, the constant d 0 measures the complexity of P. Without loss of generality, d 0 &#8804; 3d since P &#8838; &#916; d and log D 1 ( , &#916; d ) &#8804; 3d log(1/ ) <ref type="bibr">(Pollard 1990, lemma 4.1)</ref>. In practice, we might choose flexible, parametric families for P with small d 0 that do not scale with d. An example might be when P consists of all (truncated) Poisson distributions with mean at most &#923;, in which case, one can take d 0 2 max (1, log(&#923;)), independently of d (and the truncation).</p><p>Another example is given in Section 6 using beta distributions. In general, we expect that our performance bounds must depend on the complexity of P in some way, because we impose no assumptions on the function h( m) that selects the anchor and, hence, must control behavior across all of P.</p><p>Both proofs follow the strategy of Section 4.1 (see Section C.2 of Online Appendix C). The key idea to bounding the packing numbers is again to leverage continuity and cover the set {(&#945;, q) : &#945; &#8805; 0, q &#8712; P}. Since both proofs leverage Lemma 4.1, the right-hand sides of the bounds are the same.</p><p>By contrast, the left-hand sides of Theorems 4.3 and 4.4 are different: the first measures suboptimality relative to an oracle with a prespecified anchor, whereas the second is relative to an oracle that can optimize the choice of anchor. This distinction mirrors the difference between "estimate-then-optimize" procedures and those which choose parameters in an optimizationaware fashion. Continuing our example where P is a set of Poisson distributions, Theorem 4.3 bounds the suboptimality of Shrunken-SAA when using (all) the data to fit a Poisson distribution without regard to the downstream optimization, for example, by maximum likelihood, and then choosing &#945; and x k (&#8226;) to optimize. By contrast, Theorem 4.4 bounds the performance of Shrunken-SAA when choosing the anchor, &#945; and x k (&#8226;) simultaneously to optimize the downstream optimization. Finally, the policies JS-GM, S-SAA-GM, and Oracle-GM each shrink toward the grand-mean anchor, pGM . JS-GM pools according to Theorem 2.1, S-SAA-GM is our Shrunken-SAA Algorithm, and Oracle-GM is the oracle pooling.</p><p>Contrasting the JS policies and the decoupled policies illustrates the value of data pooling in a "generic" fashion. Contrasting the Shrunken-SAA policies and JS policies quantifies the additional benefit of data pooling in an optimization-aware fashion. Contrasting the "Beta" and "Fixed" anchor versions quantifies the value of a good anchor.</p><p>Before presenting details, we summarize our main findings. When N is moderate to large, all methods perform comparably to the full-information solution.</p><p>When N is small to moderate, however, our Shrunken-SAA policies provide a significant benefit over SAA and a substantial benefit over JS variants that do not leverage the optimization structure. This is true even for moderate K (K &#8804; 100) and even when the Nk 's are fixed (violating Assumption 3.1). The value of d has little effect on the performance of Shrunken-SAA; it strongly outperforms decoupling, even as d &#8594; &#8734;. Finally, our GM heuristic has very strong performance, comparable to the Beta variants that optimize the choice of anchor, at a much smaller computational cost.</p><p>We present all results as "% Benefit over SAA"; that is, bigger values are better.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">An Idealized Synthetic Data Set</head><p>We first consider an ideal setting. Specifically, after discretizing demand for each store into d 20 buckets, we set p k to be the (store-level) empirical distribution of demand. We then simulate synthetic data according to Section 2.1 under Assumption 3.1.</p><p>We train each policy and then evaluate its true performance using the p k . We repeat this process 200 times. The left panel of Figure <ref type="figure">5</ref> shows the average results for a subset of the policies (see Table <ref type="table">EC</ref>.1 in Online Appendix E for all policies).</p><p>Shrunken-SAA significantly outperforms decoupling even for K as small as 10. For large K, the benefit is as large as 10%-15%. Both Shrunken-SAA policies converge quickly to their oracle benchmarks. JS policies also outperform the decoupled solutions but by a smaller amount (5%-10%). Shrinking to the grand mean outperforms shrinking to the uniform distribution, since, as observed earlier, the true distributions are far from uniform and have quantiles far from the uniform quantile. The grand-mean policies perform comparably to our Beta policies (see Table EC.1).</p><p>Section E.4 in Online Appendix E presents additional results such as the standard deviation of performance and the amount of pooling for each variant. Overall, Shrunken-SAA methods are less variable and pool more than competitors. In particular, JS variants pool very little because of the demand heterogeneity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Relaxing Assumption 3.1</head><p>We next repeat the experiment of the previous section but now simulate data with Nk 10 for all k and all runs. Results are shown in the second panel of Figure <ref type="figure">5</ref> and in Section E.4 in Online Appendix E. We see the same qualitative features. Specifically, our Shrunken-SAA methods converge to oracle performance, and, even for moderate K, they significantly outperform decoupling. The JS methods offer a much smaller improvement over SAA. Many of the other features with respect to convergence in &#945; and standard deviation of the performance are also qualitatively similar.  These results support our claim that Assumption 3.1 is not crucial to performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Historical Backtest</head><p>We next consider a more realistic setting. We employ repeated random subsampling validation with our data to assess each method: for each store, we select 10 days randomly from the data set for training and an additional 10 days for testing. Since store k may be missing data on these days, we train (respectively, test) with at most 10 points. We use repeated, random subsampling validation instead of five-fold cross-validation in order to limit the number of data points Nk used in each subproblem.</p><p>Figure <ref type="figure">6</ref> shows results with d 20 for a subset of policies. See Table <ref type="table">EC</ref>.3 in Online Appendix E for all policies. Importantly, we see the same features as in our synthetic data experiment: our Shrunken-SAA methods converge to oracle optimality and offer a substantive improvement over SAA for large enough K. They also outperform JS variants that do not leverage the optimization structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Other Experiments with Synthetic and Real Data</head><p>Sections E.6-E.8 in Online Appendix E study the robustness of Shrunken-SAA to the number of support points d, its performance as N &#8594; &#8734;, and compares computationally cheaper variants of the algorithm that substitute two-fold or five-fold cross-validation for the LOO validation step. We omit details for space. Generally, we find that (i) Shrunken-SAA is quite robust to d; (ii) as N increases, Shrunken-SAA retains many of SAA's strong large-sample properties; (iii) Other forms of cross-validation perform quite well and are viable alternatives in computationally limited settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>We introduce the data-pooling phenomenon for stochastic optimization-when simultaneously solving many data-driven stochastic optimization subproblems, pooling data across subproblems may improve performance, even when (1) subproblems are unrelated and (2) data for each subproblem are independent. We propose Shrunken-SAA, a simple algorithm that exploits this phenomenon, and prove that, as the number of subproblems grows large, Shrunken-SAA identifies whether pooling can improve upon decoupling and, if so, the ideal amount to pool, even if the amount of data per subproblem is fixed and small. Empirical evidence further suggests that Shrunken-SAA offers significant benefits for a wide variety of problems. We hope that our work inspires researchers to think of data pooling as an "additional knob" to leverage in data-driven decision-making under uncertainty.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Management Science, Articles inAdvance, pp. 1-21, &#169; 2021 INFORMS   </p></note>
		</body>
		</text>
</TEI>
