<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI</publisher>
				<date>03/25/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10505982</idno>
					<idno type="doi">10.1609/aaai.v38i8.28662</idno>
					<title level='j'>Proceedings of the AAAI Conference on Artificial Intelligence</title>
<idno>2159-5399</idno>
<biblScope unit="volume">38</biblScope>
<biblScope unit="issue">8</biblScope>					

					<author>Qi Zhang</author><author>Yi Zhou</author><author>Ashley Prater-Bennette</author><author>Lixin Shen</author><author>Shaofeng Zou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes chi^2-divergences as a special case. We prove that our algorithm finds an epsilon-stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVaR) DRO.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Machine learning algorithms typically employ the approach of Empirical Risk Minimization (ERM), which minimizes the expected loss under the empirical distribution P 0 of the training dataset and assumes that test samples are generated from the same distribution. However, in practice, there usually exists a mismatch between the training and testing distributions due to various reasons, for example, in domain adaptation tasks domains differ from training to testing <ref type="bibr">(Blitzer, McDonald, and Pereira 2006;</ref><ref type="bibr">Daume III and Marcu 2006)</ref>; test samples were selected from minority groups which are underrepresented in the training dataset <ref type="bibr">(Grother et al. 2011;</ref><ref type="bibr">Hovy and S&#248;gaard 2015)</ref> and there might exist potential adversarial attacks <ref type="bibr">(Goodfellow, Shlens, and Szegedy 2014;</ref><ref type="bibr">Madry et al. 2017)</ref>. Such a mismatch may lead to a significant performance degradation.</p><p>This challenge spurred noteworthy efforts on developing a framework of Distributionally Robust Optimization (DRO) e.g., <ref type="bibr">(Ben-Tal et al. 2013;</ref><ref type="bibr">Shapiro 2017;</ref><ref type="bibr">Rahimian and Mehrotra 2019)</ref>. Rather than minimizing the expected loss under one fixed distribution, in DRO, one seeks to optimize the expected loss under the worst-case distribution in an uncertainty set of distributions.</p><p>Specifically, DRO aims to solve the following problem:</p><p>where U(P 0 ) is an uncertainty set of distributions centered at P 0 , P 0 is the empirical distribution of the training dataset, `is the loss function, and x is the optimization variable. For example, the uncertainty set can be defined as</p><p>where D is some distance-like metric, e.g., Kullback-Leibler (KL) divergence and 2 divergence, and &#8674; is the uncertainty level. In practice, for ease of implementation and analysis, a relaxed formulation of eq. ( <ref type="formula">1</ref>), which is referred to as the penalized DRO, is usually solved <ref type="bibr">(Levy et al. 2020;</ref><ref type="bibr">Jin et al. 2021;</ref><ref type="bibr">Qi et al. 2021;</ref><ref type="bibr">Sinha et al. 2017)</ref>:</p><p>where &gt; 0 is a fixed hyperparameter that needs to be chosen manually. In contrast to constrained DRO in eq. ( <ref type="formula">1</ref>), a regularization term is added to the objective function to keep the distribution Q and the distribution P 0 close, and the hyperparameter is manually chosen beforehand to control the tradeoff with minimizing the loss. Compared with the penalized DRO setting, the constrained DRO problem in eq. ( <ref type="formula">1</ref>) requires that the distribution Q to be strictly in the uncertainty set, and searches for the optimal solution under the worst-case distribution in the uncertainty set. Therefore, the obtained solution from the constrained DRO is minimax optimal for distributions in the uncertainty set, whereas it is hard to get such a guarantee for the penalized DRO relaxation. In this paper, we focus on the challenging constrained DRO problem in eq. ( <ref type="formula">1</ref>).</p><p>Existing studies on constrained DRO are limited to convex loss functions or require some additional assumptions <ref type="bibr">(Soma and Yoshida 2020;</ref><ref type="bibr">Hashimoto et al. 2018;</ref><ref type="bibr">Levy et al. 2020;</ref><ref type="bibr">Duchi and Namkoong 2018;</ref><ref type="bibr">Duchi, Glynn, and Namkoong 2021;</ref><ref type="bibr">Qi et al. 2022;</ref><ref type="bibr">Wang, Gao, and Xie 2021)</ref>. Little understanding on the practical non-convex loss functions, e.g., neural network, is known. In this paper, we focus on the constrained DRO problem with non-convex loss.</p><p>DRO problems under different uncertainty sets are fundamentally different. As will be discussed later in related works, there is a rich literature on DRO with various uncertainty sets. In this paper, we focus on the general Cressie-Read family divergence defined uncertainty set <ref type="bibr">(Duchi and Namkoong 2018;</ref><ref type="bibr">Jin et al. 2021)</ref>, which includes, e.g., 2 divergence, as a special case (see Section 2 for more details). We also investigate the smoothed conditional value at risk (CVaR) DRO problem.</p><p>More importantly, we focus on the practical yet challenging large-scale scenario, where P 0 is the empirical distribution of N samples and N is very large. In classic stochastic optimization problems, e.g., ERM, it is easy to get an unbiased estimate of the gradient using only a few samples, and therefore the computational complexity at each iteration is independent of the training dataset size. However, in the DRO problems, due to taking the worst-case distributions in the objective, the problem becomes challenging. Many existing DRO algorithms incur a complexity that increases linearly (or even worse) in the training dataset size <ref type="bibr">(Duchi and Namkoong 2018;</ref><ref type="bibr">Namkoong and Duchi 2016;</ref><ref type="bibr">Ghosh, Squillante, and Wollega 2018)</ref>, which is not feasible for large-scale applications. In this paper, we will design a stochastic algorithm with computational complexity at each iteration being independent of the training dataset size <ref type="bibr">(Qi et al. 2022;</ref><ref type="bibr">Wang, Gao, and Xie 2021;</ref><ref type="bibr">Levy et al. 2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Challenges and Contributions</head><p>The key challenges and main contributions in this paper are summarized as follows.</p><p>&#8226; For large-scale applications, the number of training samples is large, and therefore directly computing the full gradient is not practical. Nevertheless, as discussed above, it is challenging to obtain an unbiased estimate of the gradient for DRO problems using only a few samples. For '-divergence DRO problem, the distributions in the uncertainty set are continuous w.r.t. the training distribution. Thus the distributions in the uncertainty set can be parameterized by an N -dimensional vector <ref type="bibr">(Namkoong and Duchi 2016)</ref>. Then the DRO problem becomes a min-max problem and primal-dual algorithms <ref type="bibr">(Rafique et al. 2022;</ref><ref type="bibr">Lin, Jin, and Jordan 2020;</ref><ref type="bibr">Xu et al. 2023</ref>) can be used directly. Subsampling methods in DRO were also studied in <ref type="bibr">(Namkoong and Duchi 2016;</ref><ref type="bibr">Ghosh, Squillante, and Wollega 2018)</ref>. However, all the above studies require a computational complexity linear or even worse in the training dataset size at each iteration and thus is prohibitive in large-scale applications. In <ref type="bibr">(Levy et al. 2020</ref>), an efficient subsampling method was proposed, where the batch size is independent of the training dataset size. However, they only showed the sampling bias for 2 and CVaR DRO problems. In this paper, we generalize the analysis of the bias in <ref type="bibr">(Levy et al. 2020)</ref> to the general Cressie-Read family. We further develop a Frank-Wolfe update on the dual variables in order to bound the gap between the objective and its optimal value given the optimization variable x and the biased estimate. &#8226; The second challenge is due to the non-convex loss function. Existing studies for the Cressie-Read divergence family <ref type="bibr">(Duchi and Namkoong 2018;</ref><ref type="bibr">Levy et al. 2020</ref>) are limited to the case with convex loss function, and their approach does not generalize to the non-convex case. The key difficulty lies in that the subgradient of the objective function can not be obtained via subdifferential for nonconvex loss functions. Instead of explicitly calculating the worst-case distribution as in <ref type="bibr">(Duchi and Namkoong 2018;</ref><ref type="bibr">Levy et al. 2020)</ref>, we propose to design an algorithm for the dual problem which optimizes the objective under a known distribution. Thus the gradient of the objective can be efficiently obtained. &#8226; The third challenge is that the dual form of constrained DRO is neither smooth nor Lipschitz, making the convergence analysis difficult. Existing studies, e.g., <ref type="bibr">(Wang, Gao, and Xie 2021)</ref>, assume that the optimal dual variable is bounded away from zero, i.e., &#8676; &gt; 0 for some 0 &gt; 0, so that it is sufficient to consider 0 . However, this assumption may not necessarily be true as shown in <ref type="bibr">(Wang, Gao, and Xie 2021;</ref><ref type="bibr">Hu and Hong 2013)</ref>. In this paper, we generalize the idea in <ref type="bibr">(Qi et al. 2022;</ref><ref type="bibr">Levy et al. 2020)</ref> to the general Cressie-Read divergence family. We design an approximation of the original problem, and show that it is smooth and Lipschitz. The approximation error can be made arbitrarily small so that the solution to the approximation is still a good solution to the original. We prove the strong duality of the approximated problem. We add a regularizer to the objective and at the same time we keep the hard constraint. In this way, we can guarantee that its dual variable has a positive lower bound. Moreover, our strong duality holds for any '-divergence DRO problem.</p><p>&#8226; We design a novel algorithm to solve the approximated problem and prove it converges to a stationary point of the constrained DRO problem. The general Proximal Gradient Descent algorithm <ref type="bibr">(Ghadimi, Lan, and Zhang 2016)</ref> can be used to solve this approximated problem directly. However, it assumes the objective is non-convex in all parameters and does not provide a tight bound on the bias due to subsampling. We take advantage of the fact that the objective function is convex in and thus the bias due to subsampling can be bounded in a tighter way. Our proposed algorithm converges to a stationary point faster than existing methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Related Work</head><p>Various Uncertainty Sets. '-divergence DRO problems <ref type="bibr">(Ali and Silvey 1966;</ref><ref type="bibr">Csisz&#225;r 1967)</ref> were widely studied, for example, CVaR in <ref type="bibr">(Rockafellar, Uryasev et al. 2000;</ref><ref type="bibr">Soma and Yoshida 2020;</ref><ref type="bibr">Curi et al. 2020;</ref><ref type="bibr">Tamar, Glassner, and Mannor 2015)</ref>, 2 -divergence in <ref type="bibr">(Hashimoto et al. 2018;</ref><ref type="bibr">Ghosh, Squillante, and Wollega 2018;</ref><ref type="bibr">Levy et al. 2020)</ref>, KLdivergence in <ref type="bibr">(Qi et al. 2021</ref><ref type="bibr">(Qi et al. , 2022;;</ref><ref type="bibr">Hu and Hong 2013)</ref> and Sinkhorn distance <ref type="bibr">(Wang, Gao, and Xie 2021)</ref>, a variant of Wasserstein distance based on entropic regularization. However, the above studies are for some specific divergence function and can not be extended directly to the general Cressie-Read divergence family.</p><p>Penalized DRO. The general '-divergence DRO problem was studied in <ref type="bibr">(Jin et al. 2021)</ref> where the proposed algorithm</p><p>The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)</p><p>works for any divergence function with a smooth conjugate.</p><p>The authors also designed a smoothed version of the CVaR problem and showed their method converges to a stationary point. However, their method is for the penalized formulation and does not generalize to the constrained DRO. In this paper, we focus on the challenging constrained DRO, the solution of which is minimax optimal over the uncertainty set. Our proposed algorithm can also be applied to solve the smoothed CVaR problem in the constrained setting.</p><p>Constrained DRO With Convex Loss. The general '-divergence constrained DRO problem was studied in (Namkoong and <ref type="bibr">Duchi 2016)</ref>. Instead of optimizing from the dual form, the authors treat the worst-case distribution as a N -dimentional vector and implement a stochastic primaldual method to solve the min-max problem. However, the computational complexity at each iteration is linear in the number of the training samples and can not be used in largescale applications. The same problem was further studied in <ref type="bibr">(Duchi, Glynn, and Namkoong 2021)</ref>. The authors pointed out that minimizing constrained DRO with '-divergence is equivalent to adding variance regularization for the Empirical Risk Minimization (ERM) objective. The general Cressie-Read divergence family DRO problem was studied in <ref type="bibr">(Duchi and Namkoong 2018)</ref>, where the basic idea is to calculate the worst-case distribution for the constrained DRO first and then use the subdifferential to get the subgradient. Furthermore, the 2 and CVaR DRO problems were studied in <ref type="bibr">(Levy et al. 2020)</ref>. Compared with the method in <ref type="bibr">(Duchi and Namkoong 2018)</ref>, they calculate the worst-case distribution for the penalized DRO and then optimize both the Lagrange multiplier and the loss function. This approach converges to the optimal solution with a reduced complexity. Their method can be extended to the general Cressie-Read divergence family. However, all the above papers are limited to the case with convex loss function. To the best of our knowledge, our work is the first paper on large-scale non-convex constrained DRO with the general Cressie-Read divergence family. We note that the KL DRO was studied in <ref type="bibr">(Qi et al. 2022)</ref>, which however needs an exponential computational complexity. We achieve a polynomial computational complexity for the Cressie-Read divergence family.</p><p>2 Preliminaries and Problem Model</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Notations</head><p>Let s be a sample in S and P 0 be the distribution on the points {s i } N i=1 , where N is the size of the support. Denote by n := {p 2 R n | P n i=1 p i = 1, p i 0} the ndimensional probability simplex. Denote by x 2 R d the optimization variable. We denote by 1 X (x) the indicator function, where 1 X (x) = 0 if x 2 X, otherwise 1 X (x) = 1. Let `: R d &#8677; S ! R be a non-convex loss function. Let k &#8226; k be the Euclidean norm and (t) + := max{t, 0} be the positive part of t 2 R. Denote r x by the gradient to x. For a function f :</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Assumptions</head><p>In this paper, we take the following standard assumptions that are commonly used in the DRO literature <ref type="bibr">(Duchi and Namkoong 2018;</ref><ref type="bibr">Levy et al. 2020;</ref><ref type="bibr">Qi et al. 2021</ref><ref type="bibr">Qi et al. , 2022;;</ref><ref type="bibr">Wang, Gao, and</ref> Xie 2021; Soma and Yoshida 2020):</p><p>&#8226; The non-convex loss function is bounded:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">DRO Objective and Its Dual Form</head><p>In empirical risk minimization (ERM), the goal is to solve</p><p>where the objective function is the expectation of the loss function with respect to the training distribution P 0 . To solve the distributional mismatch between training data and testing data, the formulation of Distributionally Robust Optimization (DRO) <ref type="bibr">(Goodfellow, Shlens, and Szegedy 2014;</ref><ref type="bibr">Madry et al. 2017</ref>; Rahimian and Mehrotra 2019) was developed, where the goal is to minimize the expected loss with respect to the worst distribution in an uncertainty set U(P 0 ): &#8984; dP 0 , where ' is a non-negative convex function such that '(1) = 0 and '(t) = +1 for ant t &lt; 0. Then let the uncertainty set U(P 0 ) := {Q : D ' (QkP 0 ) &#63743; &#8674;} where &#8674; &gt; 0 is the radius of the uncertainty set.</p><p>In this paper, we study the general Cressie-Read family of '-divergence <ref type="bibr">(Cressie and Read 1984;</ref><ref type="bibr">Van Erven and Harremos 2014)</ref>, where</p><p>This family includes as special cases 2 -divergence (k = 2) and KL divergence (k ! 1). When k &gt; 2, the conjugate function of ' k (t) (which will be introduced later) is not smooth, thus the problem becomes hard to solve even in the penalized formulation <ref type="bibr">(Jin et al. 2021)</ref>. In this paper, we focus on k 2</p><p>Solving ( <ref type="formula">6</ref>) directly is challenging due to the sup over Q. In (Namkoong and Duchi 2016), a finite-dimensional vector q was used to parameterize the distributions in the uncertainty set since Q &#8999; P 0 for '-divergence. Then the DRO problem becomes a convex concave min-max problem. This method can be extended to the case with non-convex loss function by applying the algorithms for non-convex concave min-max problems <ref type="bibr">(Rafique et al. 2022;</ref><ref type="bibr">Lin, Jin, and Jordan 2020;</ref><ref type="bibr">Xu et al. 2023</ref>). However, the dimension of distribution in the uncertainty set is equal to the number of training samples. Thus, the computational complexity at each iteration is linear in the sample size and is prohibitive in largescale applications.</p><p>To obtain a complexity independent of the sample size, one alternative is to use its dual. By duality, we can show that the DRO objective (6) can be equivalently written as <ref type="bibr">(Levy et al. 2020;</ref><ref type="bibr">Shapiro 2017</ref>)</p><p>} is the conjugate function of ' k (t 0 ). In this way, the optimization problem under an unknown distribution is rewritten into one under a known distribution. The subsampling method can then be used, which leads to a complexity independent of the sample size (which will be introduced later). For the Cressie-Read family in (5), the corresponding conjugate function family is</p><p>. Therefore, the objective can be written as</p><p>Let &#8984; = &#8984; k 1 and the corresponding objective is</p><p>Thus the goal is to solve</p><p>where F (x; ; &#8984;) is defined as</p><p>. Therefore, we reformulate the DRO problem as one to minimize an objective function under a known distribution, where subsampling method could be used to reduce the complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Analysis of Constrained DRO</head><p>In this section, we analyze the constrained DRO problems under Cressie-Read family divergence uncertainty sets with general smooth non-convex loss function. We first discuss the challenges appearing in constrained formulations, then we present how to construct the corresponding approximated problem in order to overcome these challenges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Smooth and Lipschitz Approximation</head><p>For 2 [0, +1), &#8984; 2 R, the objective function F (x; ; &#8984;) is neither smooth nor Lipschitz. Thus it is difficult to implement gradient-based algorithms. In the following, we will construct an approximation of the original problem so that the objective function F (x; ; &#8984;) becomes smooth and Lipschitz by constraining both and &#8984; in some bounded intervals.</p><p>Denote by ! = (k(k 1)&#8674;+1) 1 k&#8676; . Since the loss function is bounded such that 0 &#63743; `&#63743; B, we can show that there exists an upper bound &#175; = (k 1)! 1</p><p>which only depends on k, &#8674; and B such that the optimal value &#8676; &#63743; &#175; . In this paper, we do not assume that &#8676; 0 &gt; 0 as in <ref type="bibr">(Wang, Gao, and Xie 2021)</ref>. Instead, we consider an approximation with 2 [ 0 , &#175; ], and show that the difference between the orignial and the approximation can be bounded. We can show corresponding optimal</p><p>k&#8676; 1 . The proof can be found in Appendix A. The challenge lies in that the value of &#8984; can be negative. Thus given this &#8984;, the optimal value of can be quite large then it is hard to upper bound . In our proof, we change the objective to the function that only depends on &#8984; and find the lower bound on &#8984;. Based on this lower bound, we solve this challenge and further get the bound on .</p><p>We show that the difference between the original and the approximation can be bounded in the following lemma.</p><p>The proof can be found in Appendix B. Note in the proof of this lemma, we provide the strong duality of</p><p>where both the hard constraint and regulator are kept. This is different from the approach in Section 3.2 of <ref type="bibr">(Shapiro 2017)</ref>. Note this strong duality holds for any '-divergence DRO problem.</p><p>Lemma 1 demonstrates that the non-smooth objective function can be approximated by a smooth objective function. A smaller 0 makes the gap smaller but the function "less smooth".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Convexity and Smoothness on Parameters</head><p>The advantage of our approximated problem is that the function is smooth in all x, , and &#8984;. Moreover, We find that the objective function is convex in and &#8984; though the loss function is non-convex in x. Lemma 2. Define z = ( , &#8984;) 2 M, where M = {( , &#8984;) : </p><p>The proof can be found in Appendix C. Note the firstorder gradient of the objective function is non-differential at some point when k &#8676; = 2. Therefore, we discuss in cases: k &#8676; &gt; 2 and k &#8676; = 2. In the first case, we can get the Hessian matrix of the objective. In the second case, we show the smoothness and convexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Mini-Batch Algorithm</head><p>Existing constrained stochastic algorithm for general nonconvex functions <ref type="bibr">(Ghadimi, Lan, and Zhang 2016)</ref> can be used to solve the approximated problem directly. However, their method optimizes y = (x; ; &#8984;) as a whole. It can be seen that the objective function is non-convex in y and the computation complexity to get the &#9999;-stationary point is O(&#9999; 3k&#8676; 5 ).</p><p>In the previous section, we show that</p><p>). If we optimize all the parameters together, we need to implement non-convex algorithms to optimize a smooth function with a large smooth constant, which is not computationally efficient. However, if we optimize x and z separately, though L z &gt; L x which requires more resources to optimize z, the convexity in z makes it faster to converge to the optimal value of z.</p><p>This motivates us to consider a stronger convergence criterion. Instead of finding the &#9999;-stationary point for F (y), we can find (x, , &#8984;) such that</p><p>We then provide our Stochastic gradient and Frank-Wolfe DRO algorithm (SFK-DRO), which optimizes x and z separately (see Algorithm 1). Define D = max z1,z22M kz 1</p><p>The convergence rate is then provided in the following theorem.</p><p>Theorem 1. With a mini-batch size</p><p>&#9999; 2 ) iterations are needed to guarantee a stationary point (x t 0 +1 ; z t 0 ) in expectation:</p><p>The detailed proof can be found in Appendix D and a proof sketch will be provided later. Before that, we introduce a lemma for our subsampling method. Via this lemma, we can show the complexity is independent of the sample size and thus is suitable for our large-scale setting. When we optimize z, an estimator</p><p>Though the estimator is unbiased, in our Frank-Wolfe update process <ref type="bibr">(Jaggi 2013;</ref><ref type="bibr">Frank, Wolfe et al. 1956</ref>; Lacoste-Julien 2016) we need to estimate min F (x; z) via E min f z (x; z). Obviously, the expectation of minimum is not equal to the minimum of expectation, thus it is a biased estimator. In the following lemma, we show that this gap can be bounded by a decreasing function of the sample batch n z .</p><p>Lemma 3. For any bounded loss function</p><p>The detailed proof can be found in Appendix E. Note that (Levy et al. 2020) only shows this lemma when k &#8676; = 2, and we extend the results to k &#8676; &gt; 2. This lemma shows that the gap is in the order of O(n z 1 k&#8676; ) and is independent of the total number of samples.</p><p>The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Proof Sketch of Theorem 1</head><p>We use a stochastic gradient descent method <ref type="bibr">(Moulines and Bach 2011;</ref><ref type="bibr">Gower et al. 2019;</ref><ref type="bibr">Robbins and Monro 1951)</ref> to update x. Since the objective function is L x -smooth in x, if &#8629; &#63743; 1 2Lx we have that:</p><p>where</p><p>and we can show that</p><p>Since z 2 M, instead of the stochastic gradient descent method, we employ the Frank-Wolfe method <ref type="bibr">(Frank, Wolfe et al. 1956</ref>) to update z. Define e t = arg min e2M he, r z f z (x t+1 ; z t )i and g t = he t z t , r z f z (x t+1 ; z t )i. In addition, we have <ref type="bibr">Jaggi 2013)</ref>. We can show that gt C &#8672; O( 0 ) thus for small 0 we have gt C &#63743; 1. Then due to the fact that the objective is L z -smooth in z (( <ref type="formula">9</ref>) of (Lacoste-Julien 2016)), we have that</p><p>By recursively adding ( <ref type="formula">9</ref>) and ( <ref type="formula">11</ref>), we have that</p><p>Since L z &#8672; O( k&#8676; 1 0</p><p>) and L z &#8672; O( k&#8676; 1 0</p><p>), for small</p><p>C&#9999; 2 , and denote = F (x 1 ; z 1 ) min x,z2M F (x; z), for some t 2 [1, T ] we have</p><p>We choose (x t+1 , z t ) as our output and we need to bound</p><p>By Lemma 3, we pick</p><p>By Lemma 1, when 0 = &#9999; 8&#8674; , we have</p><p>Thus we have</p><p>which completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Smoothed CVaR</head><p>Our algorithm can also solve other DRO problems efficiently, for example, the Smoothed CVaR proposed in <ref type="bibr">(Jin et al. 2021)</ref>. The CVaR DRO is an important '-divergence DRO problem, where '(t) = 1 [0, 1 &#181; ) if 0 &#63743; t &lt; 1 &#181; , and 0 &lt; &#181; &lt; 1 is some constant. The dual expression of CVaR can be written as</p><p>The dual of CVaR is non-differentiable, which is undesirable from an optimization viewpoint. To solve this problem, <ref type="bibr">(Jin et al. 2021)</ref> proposed a new divergence function, which can be seen as a smoothed version of the CVaR. Their experiment results show the optimization of smoothed CVaR is much easier. However, <ref type="bibr">(Jin et al. 2021</ref>)'s method only works for the penalized formulation of DRO. We will show that our method can solve the constrained smoothed CVaR.</p><p>Here, the divergence function is</p><p>The corresponding conjugate function is</p><p>The objective function is then written as</p><p>We can show that there exist upper bounds for the optimal values &#8676; and &#8984; &#8676; . There exists a &#175; &gt; 0 only depends on &#181;, B and &#8674; such that &#8676; 2 [0, &#175; ] and &#8984; &#8676; 2 [0, B]. The proof can be found in Appendix F. This objective function is non-smooth when ! 0. Therefore, we take a similar approach as the one in Section 3.1 to approximate the original problem with 2 [ 0 , &#175; ]. We bound the difference in the following lemma.</p><p>The proof is similar to Lemma 1 thus is omitted here. In addition, we can show that F s (x; z) is L 0 z -smooth and convex in z, where L</p><p>). Similar to eq. ( <ref type="formula">42</ref>) and Remark 1 in (Levy et al. 2020), we can prove that min z2M [F s (x t+1 ; z)] E [min z2M f s (x t+1 ; z)] &#8672; O(n 0.5 s ). We then use Algorithm 1 directly and the complexity to get the &#9999;-stationary point is O(&#9999; 7 ). The detailed proof can be found in Appendix F. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Numerical Results</head><p>In this section, we verify our theoretical results in solving an imbalanced classification problem. In the experiment, we consider a non-convex loss function and k is set to be 2 for the Cressie-Read family. We will show that 1) to optimize the same dual objective function, our proposed algorithm converges faster than the general Proximal Gradient Descent(PGD) <ref type="bibr">(Ghadimi, Lan, and Zhang 2016)</ref>;</p><p>2) The performance proposed algorithm for the constrained DRO problem outperforms or is close to the performance of the penalized DRO with respect to the worst classes. Both of them outperform the baseline. Tasks. We conduct experiments on the imbalanced CIFAR-10 dataset, following the experimental setting in <ref type="bibr">(Jin et al. 2021;</ref><ref type="bibr">Chou et al. 2020)</ref>. The original CIFAR-10 test dataset consists of 10 classes, where each of the classes has 5000 images. We randomly select training samples from the original set for each class with the following sampling ratio: {0.804, 0.543, 0.997, 0.593, 0.390, 0.285, 0.959, 0.806, 0.967, 0.660}. We keep the test dataset unchanged. Models. We learn the standard Alexnet model in <ref type="bibr">(Krizhevsky, Sutskever, and Hinton 2012)</ref> with the standard cross-entropy (CE) loss. For the comparison of convergence rate, we optimize the same dual objective with the PGD algorithm in <ref type="bibr">(Ghadimi, Lan, and Zhang 2016)</ref>. To compare robustness, we optimize the ERM via vanilla SGD. In addition, we propose an algorithm PAN-DRO, which fixes and only optimizes &#8984; and the neural network. Thus it gets the solution for the penalized DRO problem. Training Details. We set 1 = 1, &#8984; 1 = 0, 0 = 0.1, &#8984; = 10, and the upper bounds &#175; = 10, B = 10. To achieve a faster optimization rate, we set the learning rate &#8629; = 0.01 before the first 40 epochs and &#8629; = 0.001 after. The minibatch size is chosen to be 128. All of the results are moving averaged by a window with size 5. The simulations are repeated by 4 times. Results. In Figure <ref type="figure">1</ref>, we plot the value of the CE loss using different algorithms through the training process. It can be seen that to optimize the same dual objective function with the same learning rate, the PGD algorithm converges slower than our proposed DRO algorithms, which matches our theoretical results. Moreover, compared with ERM, the DRO algorithms have higher training losses but lower test losses, which demonstrates they are robust.</p><p>We also provide the test accuracy of trained models in Table 1. It can be shown that for class 3, 4, 5, the accuracies are the lowest due to the limited samples. For these classes, the performance of our SFK-DRO algorithm for the constrained DRO is better or close to the performance of PAN-DRO for the penalized DRO. Both DRO algorithms outperform the </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper, we developed the first stochastic algorithm for large-scale non-convex stochastic constrained DRO problems in the literature with theoretical convergence and complexity guarantee. We developed a smooth and Lipschitz approximation with bounded approximation error to the original problem. Compared with existing algorithms, the proposed algorithm has an improved convergence rate. The computational complexity at each iteration is independent of the size of the training dataset, and thus our algorithm is applicable to large scale applications. Our results hold for a general family of Cressie-Read divergences.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>The Thirty-Eighth AAAI Conference on Artificial Intelligence </p></note>
		</body>
		</text>
</TEI>
