<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Density-Convoluted Support Vector Machines for High-Dimensional Classification</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10438240</idno>
					<idno type="doi">10.1109/TIT.2022.3222767</idno>
					<title level='j'>IEEE Transactions on Information Theory</title>
<idno>0018-9448</idno>
<biblScope unit="volume">69</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Boxiang Wang</author><author>Le Zhou</author><author>Yuwen Gu</author><author>Hui Zou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The support vector machine (SVM) is a popular classification method which enjoys good performance in many real applications. The SVM can be viewed as a penalized minimization problem in which the objective function is the expectation of hinge loss function with respect to the standard non-smooth empirical measure corresponding to the true underlying measure. We further extend this viewpoint and propose a smoothed SVM by substituting a kernel density estimator for the measure in the expectation calculation. The resulting method is called density convoluted support vector machine (DCSVM). We argue that the DCSVM is particularly more interesting than the standard SVM in the context of high-dimensional classification. We systematically study the rate of convergence of the elastic-net penalized DCSVM and prove it has order Op( s log p n ) under general random design setting. We further develop novel efficient algorithm for computing elastic-net penalized DCSVM. Simulation studies and 8 benchmark datasets are used to demonstrate the superior classification performance of elastic-net DCSVM over other competitors, and it is demonstrated in these numerical studies that the computation of DCSVM can be more than 100 times faster than that of the SVM.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Due to the advanced technology for data collection over the past decades, there has been a surge of data complexity in many research fields such as genomics, genetics, and finance, among others. Consequently, it is very common for the number of predictors in the dataset to be far larger than the number of observations <ref type="bibr">(Donoho et al., 2000)</ref>.</p><p>For example, in genomics it is crucial to build a classifier for the purpose of disease diagnosis, with thousands of candidate genes at hand but only tens of instances available for study. Such high dimensionality in data makes Le Zhou is with the School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA (email: zhou0819@umn.edu).</p><p>Boxiang Wang is with the Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA (email: boxiangwang@uiowa.edu).</p><p>Yuwen Gu is with the Department of Statistics, University of Connecticut, Storrs, CT 06269, USA (email: yuwen.gu@uconn.edu).</p><p>Hui Zou is with the School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA (email: zouxx019@umn.edu). Zou is supported in part by <ref type="bibr">NSF grants DMS 1915842 and 2015120.</ref> traditional classification methods infeasible and poses new challenges from both theoretical and computational perspectives.</p><p>One method for performing high dimensional classification is the penalized large margin classifier. The standard support vector machine (SVM), initially proposed and investigated in <ref type="bibr">Boser et al. (1992)</ref> and <ref type="bibr">Vapnik (1995)</ref>, has an objective equal to hinge loss plus an 2 penalty. It is also referred to as 2 -norm SVM. When the dimension greatly exceeds the sample size and there are many noisy features in the predictor set, it has been shown that it is more beneficial to use a sparse penalty such as the 1 norm penalty (a.k.a. the lasso) to replace the 2 norm penalty in order to perform classification and variable selection simultaneously in high dimensional setting <ref type="bibr">(Zhu et al., 2003)</ref>.</p><p>Consider the 1 norm SVM for example. It can be written as</p><p>where L(u) = (1 -u) + is the hinge loss. Just like in lasso regression, the 1 penalty induces sparsity in the solution and is thus capable of removing irrelevant features. More recently, <ref type="bibr">Peng et al. (2016)</ref> investigated the rate of convergence of the 1 -norm SVM and an error bound of O( s log p n ) was established in their paper. The sparse penalized SVM can be computationally intensive especially when the number of predictors is huge in the dataset, owing to the non-differentiable loss function part. It is known that penalized problem in high dimensions with a smooth loss function can be efficiently computed by cyclical coordinate descent algorithm <ref type="bibr">(Friedman et al., 2010)</ref>. Nevertheless, the SVM is based on the non-differentiable hinge loss, which means that there is no convergence guarantee if one uses cyclical coordinate descent to solve the SVM. In principle, coordinate descent may not give the right solution due to the non-differentiability of the objective function <ref type="bibr">(Luo and Tseng, 1992;</ref><ref type="bibr">Tseng, 2001)</ref>. A similar problem under regression context is the quantile regression, in which the check loss is not differentiable <ref type="bibr">(Fan et al., 2020)</ref>. The typical method of solving quantile regression is the interior point algorithm. Since 1 -norm SVM can be transformed into linear programming, one may also consider interior point algorithm for solving it.</p><p>However, interior point algorithm may not scale well with high dimensional input and thus is not suitable for solving SVM in high dimensions.</p><p>Recently, <ref type="bibr">Fernandes et al. (2021)</ref> studied an interesting smoothing technique for solving quantile regression with statistical guarantees. Later, <ref type="bibr">Tan et al. (2021)</ref> further studied the smoothing quantile regression under high dimensional settings and showed that the statistical property of quantile regression is maintained after smoothing.</p><p>Motivated by their work, we develop a smooth version of SVM from statistical perspective, as opposed to trying to solve it exactly. Consider the first term in (I.1)</p><p>which is non-smooth. If we could replace it by some smooth loss such that the resulting estimator has nice theoretical properties, then we should focus on solving the smooth problem instead of the original problem. In fact, one may view (I.2) as the expectation of the hinge loss function with respect to the empirical measure assigning 1 n probability mass to each y i (x T i &#946; + &#946; 0 ), i = 1, . . . , n. The empirical measure is viewed as an estimator for the true distribution of the random variable y(x T &#946; + &#946; 0 ). Clearly, if we estimate the true distribution by using a smoothed kernel density estimator, then we can take the expectation of the hinge loss function with respect to the distribution determined by the smoothed kernel density estimator. This leads us to a new objective function</p><p>which we use to replace (I.2). Here h is the bandwidth of kernel density estimator and is used to index the new classifier. The resulting estimator is named as density convoluted support vector machine (DCSVM), since the kernel density estimator has a convolution interpretation. We study the following general form of penalized DCSVM in high dimensions,</p><p>The resulting estimator is called elastic-net DCSVM, which involves both 1 -DCSVM and 2 -DCSVM as special cases. By its convexity and smoothness, elastic-net DCSVM can be efficiently solved by using the generalized coordinate descent algorithm <ref type="bibr">(Yang and Zou, 2013)</ref>.</p><p>In this paper, we first study the theoretical properties of the elastic-net DCSVM. We show that the convergence rate of the elastic-net DCSVM is O p ( s log p n ) under the general random design setting. Furthermore, we develop novel efficient algorithm for computing elastic-net DCSVM. We use simulation studies and 8 benchmark datasets to demonstrate that elastic-net DCSVM delivers superior classification performance over its competitors, and the computational speed of DCSVM can be two orders of magnitude faster than that of SVM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. DENSITY-CONVOLUTED SVM</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Notation and definitions</head><p>We first introduce some notation that is used throughout the paper. For an arbitrary index set A &#8834; {1, . . . , p}, any vector c = (c 1 , . . . , c p ) and any n &#215; p matrix U, let c A = (c i , i &#8712; A), and let U A be the submatrix with columns of U whose indices are in A. The complement of an index set A is denoted as A c = {1, . . . , p} \ A.</p><p>For any finite set B, let |B| be the number of elements in B. For a vector c &#8712; R p and q &#8712; [1, &#8734;), let c q = ( p j=1 |c j | q ) 1 q be its q norm, let c &#8734; (or c max ) = max j |c j | be its &#8734; norm, and let c min = min j |c j | be its minimum absolute value. For a matrix M, let &#955; min (M) and &#955; max (M) be its eigenvalue with smallest absolute value and largest absolute value, respectively. This is the common notation for eigenvalues of a matrix, and &#955; min , &#955; max should not be confused with the penalization parameter used in a penalty function. For any matrix G, let G = &#955; max (G T G) be its spectral norm. In particular, for a vector c, c = c 2 . For a, b &#8712; R, let a &#8743; b = min{a, b} and a &#8744; b = max{a, b}. For a sequence {a n } and another nonnegative sequence {b n }, we write </p><p>For two sequences of random variables Z n and Z n , we write</p><p>, where x i = (x i1 , . . . , x ip ) T &#8712; R p are predictors and y i &#8712; {-1, 1} is the class label for the ith subject. We use X = (X 1 , . . . , X p ) to denote the design matrix, where X j = (x 1j , . . . , x nj ) T contains observations for the jth variable, and use y = (y 1 , . . . , y n ) T to represent the response vector. We focus on the general case where the observed data {(y i , x i )} n i=1 are i.i.d. samples from the distribution of a random vector (y, x). Let the jth component of the random vector x be denoted as x j . Meanwhile,</p><p>To perform the classification task, the support vector machine (SVM, <ref type="bibr">Vapnik, 1995)</ref> seeks a separating hyperplane {x :</p><p>It is well known that the above problem can be equivalently formulated as a penalized empirical risk minimization problem:</p><p>where L(u) = (1 -u) + = max{1 -u, 0} is known as the SVM hinge loss and &#955; 0 &gt; 0 is a tuning parameter that is one-to-one correspondent to the constant c in problem (II.1). </p><p>)&#8804;t} is the empirical cdf based on i.i.d. realizations of U . The usage of the discontinuous empirical cdf here makes the objective in (II.2) to have the same degree of smoothness as the hinge loss L(&#8226;), i,e. continuous but nondifferentiable. This has motivated us to consider an alternative estimator for the cdf. If we use an estimator F (&#8226; ; &#946;, &#946; 0 ) that is smooth enough, the &#8734; -&#8734; L(t)d F (t; &#946;, &#946; 0 ) shall lead us towards a new objective which is differentiable to certain degrees.</p><p>In particular, we consider the cdf from the kernel density estimator</p><p>where</p><p>and h &gt; 0 is the bandwidth parameter to be tuned. Replacing F by F gives the new objective function,</p><p>where</p><p>) and " * " stands for convolution. As such, with the penalty term &#955; 0 &#946; 2 2 , we obtain</p><p>We treat the classifier arisen from the above problem as a new classifier and coin it the density-convoluted SVM (DCSVM).</p><p>As discussed above, DCSVM originates from a statistical view of the SVM, while it shows merit from the computational perspective as it overcomes the non-differentiability of the original SVM problem. Smoothing a non-differentiable problem through convolution can be traced back to the idea of mollification <ref type="bibr">(Friedrichs, 1944)</ref> and has also been studied in the optimization community, for example, <ref type="bibr">Bertsekas (1973)</ref> and <ref type="bibr">Rubinstein (1983)</ref>.</p><p>The method was recently adopted to smooth the quantile regression by <ref type="bibr">He et al. (2021</ref><ref type="bibr">), Fernandes et al. (2021)</ref> and <ref type="bibr">Tan et al. (2021)</ref>.</p><p>In this work, we focus on two most popular kernel functions, Gaussian kernel and Epanechnikov kernel in DCSVM, and we denote the corresponding convoluted loss function by L G h (v) and L E h (v), respectively. For the Gaussian kernel</p><p>where &#934;(&#8226;) is the cumulative distribution function of the standard normal distribution.</p><p>For the Epanechnikov kernel K(u) = 3 4 (1 -u 2 )I(-1 &#8804; u &#8804; 1), where I(&#8226;) is the indicator function,</p><p>, the density-convoluted SVM loss functions with Gaussian kernel (left) and Epanechnikov kernels (right). Bottom row: plots of the first-order derivatives, L G h (v) and L E h (v).</p><p>The top row of Figure <ref type="figure">1</ref> depicts the DCSVM losses with Gaussian kernel and Epanechnikov kernel.</p><p>Intuitively, h should be small such that the DCSVM is very close to the SVM. According to density estimator, the optimal rate for h is</p><p>). So, we adopt h = Cn -1/5 in our implementation, where C is some numerical constant within the range (0.25, 3). The actual value of C in practice can be determined by cross-validation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Sparse density-convoluted SVM</head><p>In high dimensions, we consider designing the estimator under a sparsity assumption that &#946; * has many zero components. Let A = {j : &#946; * j = 0, 1 &#8804; j &#8804; p} be the support set of &#946; * , i.e., the set of indices of the important covariates. Let s = |A|. Throughout this paper, we allow p = pn and s = sn to diverge with n, and we assume sn &#8805; 1 and pn goes to infinity as n goes to infinity. For convenience, we still use p and s to represent these quantities since no confusion is caused. In ultra-high dimensions, the dimension p is allowed to increase exponentially with the sample size n. We also assume that s is relatively of smaller order compared to n, which is necessary for the existence of a consistent estimator.</p><p>To perform the classification for high-dimensional data, we present sparse DCSVM with an additional 1-penalty term ( &#946;0, &#946;) := argmin</p><p>The 1-penalty is used to induce sparsity in the estimator. We also consider the following version of sparse DCSVM with only an 1-penalty term:</p><p>( &#946;0, &#946;) := argmin</p><p>Borrowing the commonly used terminologies for different penalties in high dimensional literature, we refer to the estimator in (II.4) as elastic-net DCSVM, and refer to the estimator in (II.5) as lasso DCSVM. Note that the lasso DCSVM is a special case of elastic-net DCSVM with &#955;0 = 0. In the statistics literature it is well-known than the elastic-net often yields much improved prediction than the lasso owing to the additional 2 regularization that can well handle the correlated covariates which occurs often in high-dimensional data. For example, in the case of the SVM <ref type="bibr">Wang et al. (2006)</ref> showed that the elastic-net regularized SVM is more accurate than the 1-norm SVM. Therefore, we focus on the elastic-net penalized DCSVM in theory and in applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. THEORETICAL STUDIES</head><p>We now state the assumptions needed to establish our theoretical results. We first impose the following conditions on the random design.</p><p>Assumption 1. {(yi, xi)} n i=1 , (y, x) are independent and identically distributed on R &#215; R p . x is a zero-mean sub-exponential random vector, i.e. E[x] = 0, and there exists a constant m0 &gt; 0 such that sup</p><p>By definition of Orlicz norm and Markov's inequality, this further implies sup</p><p>For any index set A &#8834; {1, . . . , p}, consider the cone</p><p>be Hessian matrix of the population loss, or information matrix. We impose the following condition on the information.</p><p>Assumption 2. There exists a constant &#961; &gt; 0 such that</p><p>Assumption 1 is a general setting in the random design, which relaxes the classical condition that the components of x are bounded random variables <ref type="bibr">(Peng et al., 2016)</ref>. Assumption 2, which is a restricted eigenvalue type of condition, is needed to establish 2-type error bound for 1-penalized type of estimator. Similar conditions have been widely adopted in the literature <ref type="bibr">(B&#252;hlmann and Van De Geer, 2011;</ref><ref type="bibr">Fan et al., 2020)</ref>.</p><p>Theorem 1. Assume assumptions 1-2 hold, and s log p = o(n). Choose the tuning parameters such that 8&#955;0 &#946; * max &lt; &#955;. Then there exists a large enough constant c0 &gt; 0 such that with the choice &#955; = c0 log p n , the elastic-net DCSVM estimator ( &#946;0, &#946;) satisfies</p><p>Theorem 1 shows that the sparse DCSVM estimator can achieve the same rate of convergence as the 1-SVM <ref type="bibr">(Peng et al., 2016)</ref>. Meanwhile, the sparse DCSVM has better computational efficiency than penalized SVM due to the smoothness of its loss function, as shown in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. COMPUTATION</head><p>In this section, we develop an efficient algorithm for computing the solution path of DCSVM.</p><p>At the outset, we present the first-order derivative of the density-convoluted SVM loss and show they are Lipschitz continuous in Lemma 1:</p><p>and L E h (v) be the DCSVM loss using Gaussian kernel and Epanechnikov kernel, respectively. For</p><p>where the Lipschitz constants are given as</p><p>and c E h = 3 4h .</p><p>The bottom row of Figure <ref type="figure">1</ref> depicts L G h (v) and L E h (v). Lemma 1 gives rise to the following quadratic majorization condition for the DCSVM:</p><p>where L h is exemplified by L G h and L E h and c h is the corresponding Lipschitz constant. Based on the Lipschitz condition, we develop a generalized coordinate descent (GCD) algorithm <ref type="bibr">(Yang and Zou, 2013)</ref> to solve those sparse penalized DCSVMs. We first consider the adaptive lasso penalty. The algorithm can be easily adjusted to handle lasso and elastic net.</p><p>Without loss of generality, we assume each X j has zero mean and unit length. In a coordinate-wise manner, suppose the coordinate &#946; 1 , &#946; 2 , . . . , &#946; j-1 have been updated and we now update &#946; j . Denote by &#946;0 and &#946; by the current solution and let v i = y i ( &#946;0 + x T i &#946;). To update &#946; j , instead of solving the coordinate-wise update function,</p><p>we solve its majorization function</p><p>Likewise, &#946; 0 is updated to be &#946;0 -1</p><p>In the appendix we provide theoretical analysis of the convergence of the generalized coordinate descent algorithm which is not in <ref type="bibr">Yang and Zou (2013)</ref>. In our implementation, we further apply the strong rule <ref type="bibr">(Tibshirani et al., 2010)</ref>, warm start, and active set strategy <ref type="bibr">(Friedman et al., 2010)</ref> to further accelerate the algorithm. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. NUMERICAL STUDIES</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Simulation</head><p>In this section, we use several simulation examples to demonstrate the performance of DCSVM.</p><p>The response variables of all the simulated data are binary and the two classes are balanced, i.e., P (Y = 1) = P (Y = -1) = 0.5. In each example, define the p-dimensional mean vectors &#181; + = (0.7, 0.7, 0.7, 0.7, 0.7, 0, 0, . . . , 0) and &#181; -= -&#181; + , where p = 500 or 5000 in our experiments. Each observation from the positive class is drawn from N(&#181; + , &#931;) and each observation from the negative class is drawn from N(&#181; -, &#931;). We consider three different choices of &#931;. In example 1, &#931; = I p&#215;p so the variables are independent. In both examples 2 and 3,</p><p>where &#931; 5&#215;5 have all diagonal elements of 1 and off-diagonal elements of &#961; in example 2, and (&#931; 5&#215;5 ) i,j = &#961; |i-j| in example 3. We use &#961; = 0.2, 0.7, and 0.9.</p><p>We first compared elastic-net DCSVM with Gaussian kernel and Epanechnikov kernel with elastic-net SVM <ref type="bibr">(Wang et al., 2006)</ref> and elasticnet logistic regression that is fitted using the R package gcdnet <ref type="bibr">(Yang and Zou, 2013)</ref>. For each example, the training size is 200 and we use five-fold cross-validation to select the best tuple of (h, &#955; 0 , &#955;) where h is chosen from 0.1, 0.25, 0.5, and 1, &#955; 0 is selected from 0.5 * (10 -4 , 10 -3 , 10 -2 , 10 -1 , 1, 5), and &#955; is searched along the solution path; for the SVM and logistic regression, we select &#955; 0 and &#955; in the same manner.</p><p>We record the prediction error and run time in Table <ref type="table">I</ref>. The run time include all the time spent on tuning and training the models. We observe the DCSVM with Epanechnikov kernel has slightly better performance than DCSVM with Gaussian kernel, and both of them have better prediction accuracy than the other two methods. DCSVM with Epanechnikov kernel is the fastest while the elastic-net SVM is the slowest.</p><p>All the methods exhibited in Table I use elastic-net penalty. We now study the performance when using other sparse penalities. Due to the overall best performance, we stay with DCSVM with Epanechnikov kernel and we compare the prediction accuracy and variable selection when using lasso and elastic-net penalties. We present the results in Table <ref type="table">II</ref>. In general, we find the elastic-net has the best performance in both prediction error and variable selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Benchmark data applications</head><p>In this section, we demonstrate the performance of DCSVM using several benchmark data, which are available from UCI machine learning repository. We randomly split each data set into a training set and a test set with a 1:1 ratio. On the training set, we fit elastic-net DCSVM, elastic-net logistic regression, and elastic-net SVM, and tune each method using five-fold cross-validation. The prediction accuracy is computed based on the test set.</p><p>We present the result in Table <ref type="table">III</ref>. We observe the elastic-net DCSVM has the best performance in general.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX A PROOF OF THEOREM 1</head><p>We first give some general formula regarding the loss function L h and its derivatives. </p><p>Proof of Theorem 1. By definition of the 1 -penalized CRR estimator and triangle inequality, we have</p><p>where we denote u := &#946;&#946; * . On the other hand, by convexity of L h (&#8226;), we have</p><p>So by Hoeffding's inequality,</p><p>Meanwhile, we have E L h y(x T &#946; * + &#946; * 0 ) yx = 0 by the definition of &#946; * and optimality condition. By the choice of tuning parameters we have</p><p>Notice that by assumption 1 and</p><p>. . , n}, &#8704;j &#8712; {1, . . . , p}. By Theorem 1.4 in <ref type="bibr">G&#246;tze et al. (2021)</ref>, there exists an absolute constant &#951; 0 &gt; 0 such that</p><p>)n .</p><p>So following (A.0.5) we have</p><p>)n . Now, under E 1 &#8745; E 2 , combining (A.0.2) and (A.0.3) we have</p><p>We give an upper bound for E[H(r)]. Let &#963; 1 , . . . , &#963;n be i.i.d. Rademacher random variables (i.e. P(&#963; i = 1) = P(&#963; i = -1) = 1 2 ), which is independent from all the other random elements. By the symmetrization inequality (see for instance, Lemma 2.3.1 in <ref type="bibr">Van Der Vaart and Wellner (1996)</ref>) and contraction inequality (see for instance, Theorem 4.12 in <ref type="bibr">Ledoux and Talagrand (1991)</ref>), |L h (&#8226;)| &#8804; 1 and Cauchy-Schwarz inequality, we have</p><p>By assumption 1 and definition of Orlicz norm, we know &#963; i y i x ij &#968; 1 = x ij &#968; 1 &#8804; m 0 , &#8704;i &#8712; {1, . . . , n}, &#8704;j &#8712; {1, . . . , p}. Also, it is straightforward to see &#963; i y i &#968; 1 = 1 log 2 . By Proposition 2.7.1 in <ref type="bibr">Vershynin (2018)</ref>, there exists a constant c 1 &gt; 0 such that E[e t&#963; i y i</p><p>, &#8704;i &#8712; {1, . . . , n}, &#8704;j &#8712; {1, . . . , p}. By Jensen's inequality, we have for any</p><p>Consequently, for any</p><p>By the condition of Theorem 1, we know <ref type="formula">1</ref>), so for large enough n,</p><p>for large enough n. Thus, combining (A.0.6) and (A.0.8) we get</p><p>This implies that H(r) = Op( rs log p n</p><p>). Define event G T := {H(r) &#8804; T rs log p n } for any T &gt; 0, then we have lim T &#8594;&#8734; lim sup n&#8594;&#8734; P (G c T ) = 0.</p><p>Next, for any (&#946; 0 , &#946;) &#8712; (&#946; * 0 , &#946; * )+C(r), we derive a lower bound for E[G(&#946; 0 , &#946;)]. For large enough n, for any (&#946; 0 , &#946;) &#8712; (&#946; * 0 , &#946; * )+C(r), by Taylor's theorem and assumption 2, there exists a &#8712; [0, 1] such that</p><p>(A.0.9)</p><p>On the other hand, by our choice for tuning parameters, for any (&#946; 0 , &#946;) &#8712; (&#946; * 0 , &#946; * ) + C(r) we have </p><p>Thus, combining (A.0.9), (A.0.10) and (A.0.11), under G T , we have for any</p><p>. By convexity of F (&#8226;) and norm functions and by (A.0.12), we further have</p><p>, which is a contradiction with the definition of ( &#946;0 , &#946;). So the claim is proved. By union bound, previous results and choice of tuning parameters, we have</p><p>Since log p n = o(1), as long as c 0 is large enough (for instance c 0 &gt; 4</p><p>Combining this result and the previous claim, the proof of Theorem 1 is finished.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX B PROOF OF LEMMA</head><p>It is seen that</p><p>Thus inequality (IV.1) is obtained due to the mean value theorem.</p><p>We then prove inequality (IV.2). The inequality is trivial when</p><p>since L E h is twice differentiable between 1 -h and 1 + h, we see</p><p>where the second to the last inequality is due to</p><p>where the second to the last inequality is due to</p><p>denote the subvector of v with its kth component removed by</p><p>We also let &#8706;h be the sub-differential of a nonsmooth convex function h (see e.g., <ref type="bibr">Bertsekas, 1999)</ref>.</p><p>b) Iteration Complexity Analysis: Without loss of generality, we focus solely on the GCD algorithm for solving the weighted lasso penalized DCSVM</p><p>where w k &#8805; 0 are the weights of the penalty. Indeed, this formulation covers all the sparsity patterns in Section II-C. Also, the intercept term &#946; 0 can be absorbed into the formulation by setting x i1 = 1 for i = 1, . . . , n and w 1 = 0. For ease of exposition, let us rewrite (C.0.1) as the following unconstrained optimization problem</p><p>which implies that the gradient of g(&#8226;) is uniformly Lipschitz continuous with Lipschitz constant L = c h &#961;max. When restricted to each coordinate, we have</p><p>which implies that the gradient of g(&#8226;) is coordinate-wise uniformly Lipschitz continuous with Lipschitz constants L k = c h X k 2 , k = 1, . . . , p.</p><p>In the GCD (cyclic coordinate descent) algorithm, let &#946; r be the update of &#946; after the rth cycle, r &#8805; 0. For ease of notation, denote </p><p>is equivalent to</p><p>where the proximity operator prox does the soft-thresholding <ref type="bibr">(Parikh and Boyd, 2013)</ref> and</p><p>Our analysis will be divided into three parts: the sufficient descent step, the cost-to-go estimate step, and the local error bound step. Similar techniques can be found in <ref type="bibr">Luo and Tseng (1992)</ref>, <ref type="bibr">Luo and</ref><ref type="bibr">Tseng (1993), Zhang et al. (2013)</ref> and <ref type="bibr">Hong et al. (2013)</ref>.</p><p>c) Sufficient Descent: Consider the proximal gradient method applied to solving the following problem</p><p>we have by (C.0.3)</p><p>d) Cost-to-go Estimate: Let X * := {&#946; * |f (&#946; * ) = min &#946; f (&#946;)} be the optimal solution set of problem (C.0.2). Let &#946;r &#8712; X * be the point in X * such that d X * (&#946; r ) := min &#946;&#8712;X * &#946;&#946; r = &#946;r&#946; r . By optimality of</p><p>By the mean value theorem, there exists &#955; &#8712; [0, 1] and &#958; r = &#955;&#946; r+1 + (1 -&#955;) &#946;r such that g(&#946; r+1 ) -g( &#946;r ) = &#8711;g(&#958; r ), &#946; r+1 -&#946;r .</p><p>It follows that</p><p>By the fact that &#8711;g(&#8226;) is Lipschitz continuous, it is implied that</p><p>It follows that</p><p>Here we handle the Gaussian and Epanechnikov kernels separately.</p><p>For the Gaussian kernel, that is, when L h (&#8226;) = L G h (&#8226;), according to (C.0.4) and (C.0.5), the GCD algorithm is descending along its iterations. We can thus restrict the domain of &#946; to the sublevel set</p><p>We can see that g(&#946;) = p(X&#946;). It follows from <ref type="bibr">Zhang et al. (2013)</ref> that for any &#958; &#8805; min &#946; f (&#946;), there exist &#954;, &#949; &gt; 0 such that</p><p>for all &#946; such that &#946;prox h (&#946; -&#8711;g(&#946;)) &#8804; &#949; and f (&#946;) &#8804; &#958;.</p><p>For the Epanechnikov kernel, that is, when L h (&#8226;) = L E h (&#8226;), one needs to add an additional ridge penalty &#181; &#946; 2 for some small &#181; &gt; 0 in order to achieve strong optimality. Thus, when the Epanechnikov kernel is used, we instead consider the following problem</p><p>and solve it using the GCD algorithm.</p><p>As a summary, we show in the following theorem that the GCD algorithm converges at least linearly.</p><p>Theorem 2. The GCD algorithm converges at least linearly to a solution in X * .</p><p>Proof. We first show that there exists some &#963; &gt; 0 such that</p><p>For any r &#8805; 1 and any 1 &#8804; k &#8804; p, by the optimality of</p><p>we have</p><p>It follows that</p><p>where L = max(1, L) and L = max(1, L -1 ). Therefore, when we take &#963; we obtain</p><p>which implies that &#8710; r+1 &#8804; c 3 1 + c 3 &#8710; r , (C.0.9)</p><p>where c 3 = (c 2 &#954; 2 &#963; 2 + c 2 )c -1 1 . We can see from (C.0.9) that f (&#946; r ) approaches f * with at least linear rate of convergence. From (C.0.5) again, this further implies that the sequence {&#946; r } converges at least linearly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX D ADDITIONAL NUMERIC RESULTS WITH GAUSSIAN KERNEL</head><p>Under the same settings introduced in our simulation section, we compared the performance of lasso DCSVM and elastic-net DCSVM, using Gaussian kernel. The result is shown in Table <ref type="table">S</ref>.1. Again, we can see that the elastic-net DCSVM outperforms lasso DCSVM. We also conducted elastic-net DCSVM with Gaussian kernel on the same real datasets that we introduced in our real data section, and compared its performance with the performance of elastic-net SVM and elastic-net logistic regression. The result is displayed in Table <ref type="table">S</ref>  </p></div></body>
		</text>
</TEI>
