<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Asymptotics and Optimal Designs of SLOPE for Sparse Linear Regression</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10403504</idno>
					<idno type="doi">10.1109/ISIT.2019.8849836</idno>
					<title level='j'>IEEE International Symposium on Information Theory</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Hong Hu</author><author>Yue M. Lu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In sparse linear regression, the SLOPE estimator generalizes LASSO by assigning magnitude-dependent regularizations to different coordinates of the estimate. In this paper, we present an asymptotically exact characterization of the performance of SLOPE in the high-dimensional regime where the number of unknown parameters grows in proportion to the number of observations. Our asymptotic characterization enables us to derive optimal regularization sequences to either minimize the MSE or to maximize the power in variable selection under any given level of Type-I error. In both cases, we show that the optimal design can be recast as certain infinite-dimensional convex optimization problems, which have efficient and accurate finite-dimensional approximations. Numerical simulations verify our asymptotic predictions. They also demonstrate the superiority of our optimal design over LASSO and a regularization sequence previously proposed in the literature.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>In sparse linear regression, we seek to estimate a sparse vector &#946; &#8712; R p from</p><p>where A &#8712; R n&#215;p is the design matrix and w denotes the observation noise. In this paper, we study the sorted 1 penalization estimator (SLOPE) <ref type="bibr">[1]</ref> (see also <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>). Given a non-decreasing regularization sequence &#955; = [&#955; 1 , &#955; 2 , . . . , &#955; p ] with 0 &#8804; &#955; 1 &#8804; &#955; 2 &#8804; &#8226; &#8226; &#8226; &#8804; &#955; p , SLOPE estimates &#946; by solving the following optimization problem</p><p>where</p><p>is a reordering of the absolute values |x 1 | , |x 2 | , . . . , |x p | in increasing order. In <ref type="bibr">[1]</ref>, the regularization term J &#955; (x) def = p i=1 &#955; i |x| (i) is referred to as the "sorted 1 norm" of x. The same regularizer was independently developed in a different line of work <ref type="bibr">[2]</ref>- <ref type="bibr">[4]</ref>, where the motivation is to promote group selection in the presence of correlated covariates.</p><p>The classical LASSO estimator is a special case of SLOPE. It corresponds to using a constant regularization sequence, i.e., &#955; 1 = &#955; 2 = &#8226; &#8226; &#8226; = &#955; p = &#955;. However, with more general &#955;-sequences, SLOPE has the flexibility to penalize different coordinates of the estimate according to their magnitudes. This adaptivity endows SLOPE with some nice statistical properties that are not possessed by LASSO. For example, it is shown in <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref> that SLOPE achieves the minimax 2 estimation rate with high probability. In terms of testing, the authors of <ref type="bibr">[1]</ref> show that SLOPE controls false discovery rate (FDR) for orthogonal design matrices, which is not the case in LASSO <ref type="bibr">[7]</ref>. In addition, we note that the new regularizer J &#955; (x) is still a norm <ref type="bibr">[1]</ref>, <ref type="bibr">[3]</ref>. Thus, the optimization problem associated with SLOPE remains convex, and it can be efficiently solved by using e.g., proximal gradient descent <ref type="bibr">[1]</ref>, <ref type="bibr">[3]</ref>.</p><p>In the aforementioned studies on analyzing SLOPE, the performance of the estimator is given in terms of non-asymptotic probabilistic bounds. Such bounds provide very limited information about how to optimally design the regularization sequence &#955; in different settings, an important open question in the literature. In this paper, we provide two main contributions:</p><p>1) We obtain a characterization of SLOPE in the asymptotic regime: n, p &#8594; &#8734; and n/p &#8594; &#948;. Compared with the probabilistic bounds derived in previous work, our results are asymptotically exact. Similar asymptotic analysis has been done for LASSO <ref type="bibr">[8]</ref> and many other regularized linear regression problems <ref type="bibr">[9]</ref>- <ref type="bibr">[11]</ref>, but the main technical challenge in analyzing SLOPE is the nonseparability of J &#955; (x): it cannot be written as a sum of componentwise functions, i.e., J &#955; (x) = p i=1 J i (x i ). In our work, we overcome this challenge by showing that the proximal operator of J &#955; (x) is asymptotically separable.</p><p>2) Using our asymptotic characterization, we derive oracle optimal &#955; in two settings: (1) the optimal regularization sequence that minimizes the MSE E &#946;&#946; 2 ; and (2) the optimal sequence that achieves the highest possible power in testing and variable selection under a given level of Type-I error. In both cases, we show that the optimal design can be recast as certain infinite-dimensional convex optimization problems, which have efficient and accurate finite-dimensional approximations.</p><p>A caveat of our optimal design is that it requires knowing the limiting empirical measure of &#946; (e.g., the sparsity level and the distribution of its nonzero coefficients). For this reason, our results are oracle optimal. It provides the first step towards more practical optimal designs that are completely blind to &#946;.</p><p>The rest of the paper is organized as follows. In Sec. II, we first prove the asymptotic separability of the proximal operator associated with J &#955; (x). This property allows us to derive our main asymptotic characterization of SLOPE, summarized as Theorem 1. Based on this analysis, we present the optimal design of the regularization sequence in Sec. III. Numerical simulations verify our asymptotic characterizations. They also demonstrate the superiority of our optimal design over LASSO and a previous sequence design in the literature <ref type="bibr">[5]</ref>. Due to space constraints, we only state and illustrate the main results in this paper, and leave the technical proofs to <ref type="bibr">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. MAIN ASYMPTOTIC RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Technical Assumptions</head><p>There are four main objects in the description of our model and algorithm: (1) the unknown sparse vector &#946;; (2) the design matrix A; (3) the noise vector w; and (4) the regularization sequence &#955;. Since we study the asymptotic limit (with p &#8594; &#8734;), we will consider a sequence of instances &#946; (p) , A (p) , w (p) , &#955; (p)  p&#8712;N with increasing dimensions p, where &#946; (p) , &#955; (p) &#8712; R p , A (p) &#8712; R n&#215;p and w (p) &#8712; R n . A sequence of vectors x (p) &#8712; R p indexed by p is called a converging sequence <ref type="bibr">[8]</ref> if its empirical measure &#181; p (x)</p><p>i ) converges weakly to a probability measure on R.</p><p>Our results are proved under the following assumptions: (A.1) The number of observations grows in proportion to p:</p><p>2) The number of nonzero elements in &#946; (p) grows in proportion to p: k/p &#8594; &#961; &#8712; (0, 1].</p><p>3) The elements of A (p) are i.i.d. Gaussian distribution:</p><p>&#8764; N (0, 1 n ). (A.4) {&#946; (p) } p , {w (p) } p and {&#955; (p) } p are converging sequences.</p><p>The distribution functions of the limiting measures are denoted by F &#946; , F w and F &#955; , respectively. Moreover, we have</p><p>, where the probability P(&#8226;) and the expectations E[&#8226;] are all computed with respect to the limiting measures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Asymptotics of the Proximal Operator of J &#955; (x)</head><p>We start by studying the proximal operator associated with the sorted 1 norm J &#955; (x). Given y &#8712; R p and a regularization sequence &#955; with 0 &#8804; &#955; 1 &#8804; &#955; 2 &#8804; &#8226; &#8226; &#8226; &#8804; &#955; p , the proximal operator is defined as the solution to the following optimization problem:</p><p>where</p><p>In the case of LASSO, which corresponds to choosing</p><p>the proximal operator is easy to characterize, as it is separable: [Prox &#955; (y)] i = sign(y i ) max(|y i |&#955;, 0). In other words, the ith element of Prox &#955; (y) is solely determined by y i . However, this separability property does not hold for a general regularization sequence. When p is finite, [Prox &#955; (y)] i depends not only on y i but also on other elements of y. This coupling makes it much harder to analyze the proximal operator. Fortunately, as we show below, when p &#8594; &#8734;, Prox &#955; (&#8226;) becomes asymptotically separable. Proposition 1: Let {y (p) } p and {&#955; (p) } p be two converging sequences. Denote by F y and F &#955; the distribution functions of their respective limiting measures. It holds that</p><p>where &#951;(&#8226;; F y , F &#955; ) is a scalar function that is determined by F y and F &#955; , and &#951;(y (p) ; F y , F &#955; ) denotes a coordinate-wise application of &#951;(&#8226;; F y , F &#955; ) on y (p) . The asymptotic separability of Prox &#955; (&#8226;) greatly facilitates our asymptotic analysis and design of SLOPE, since it allows us to reduce the original high-dimensional problem to an equivalent one-dimensional problem. In what follows, we refer to &#951;(&#8226;; F y , F &#955; ) as the limiting scalar function. In the appendix, we briefly present a procedure for constructing &#951;(&#8226;; F y , F &#955; ) from the limiting distribution functions F y and F &#955; . More details can be found in <ref type="bibr">[12]</ref>.</p><p>Example 1: We compare the proximal operator Prox &#955; (y) and the limiting scalar function &#951;(y; F y , F &#955; ), for two different &#955;-sequences shown in Fig. <ref type="figure">1</ref>(a) and Fig. <ref type="figure">1(c)</ref>. The red curves represent the limiting scalar functions obtained in Proposition 1, whereas the blue circles are sample points of (y i , [Prox &#955; (y)] i ), with y i &#8764; N (0, 1). For better visualization, we randomly sample 3% of all (y i , [Prox &#955; (y)] i ). It can be seen that under a moderate dimension p = 1024, the proximal operator can already be very accurately approximated by the limiting scalar function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Asymptotics of SLOPE</head><p>We are now ready to tackle the original optimization problem (2) associated with SLOPE. Our goal is to characterize the joint empirical measure of ( &#946;, &#946;):</p><p>Indeed, many quantities of interest, such as the MSE, type-I error, and FDR, are all functionals of this joint empirical measure. A function</p><p>where L is a positive constant. As in <ref type="bibr">[8]</ref>, we will depict the limit of &#181; p ( &#946;, &#946;) through its action on pseudo-Lipschiz functions.</p><p>Theorem 1: Assume (A.1) -(A.4) hold. For any pseudo-Lipschiz function &#968;, we have</p><p>(5) Here, B, Z are two independent random variables with B &#8764; F &#946; and Z &#8764; N (0, 1); &#951;(&#8226; ; F y , F &#964; &#955; ) is the limiting scalar function defined in Proposition 1, with F y denoting the distribution function of B + &#963;Z and F &#964; &#955; denoting that of &#964; &#955; for some &#964; &#8805; 0. Moreover, the scalar pair (&#963;, &#964; ) is the unique solution of the following equations:</p><p>Remark 1: Readers familiar with the asymptotic analysis of LASSO will recognize that the forms of ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>) look identical to the results of LASSO obtained in <ref type="bibr">[8]</ref>, <ref type="bibr">[11]</ref>. Indeed, the first part of our proof directly applies the framework of analyzing LASSO asymptotics using convex Gaussian min-max theorem (CMGT) <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>. Following <ref type="bibr">[11]</ref>, in the asymptotic regime, the limiting measure of SLOPE is determined by the following fixed point equations:</p><p>Note that ( <ref type="formula">8</ref>) and ( <ref type="formula">9</ref>) are similar to ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>), except that they involve an R p &#8594; R p proximal mapping: Prox &#964; &#955; (&#946;+&#963;Z). This is where Proposition 1 becomes useful. Using the asymptotic separability stated in that proposition, we can simplify ( <ref type="formula">8</ref>) and ( <ref type="formula">9</ref>) to the scalar equations given in ( <ref type="formula">6</ref>) and <ref type="bibr">(7)</ref>. Theorem 1 essentially says that the joint empirical measure of ( &#946;, &#946;) converges weakly to the law of (&#951;(B + &#963;Z; F y , F &#964; &#955; ), B). This means that although the original problem ( <ref type="formula">2</ref>) is high-dimensional, its asymptotic performance can be succinctly captured by merely scalars random variables. In (5), if we let &#968;(x, y) = (xy) 2 , we obtain the asymptotic MSE; by setting &#968;(x, y) = 1 y=0, x =0 , we can recover the type-I error. (Technically, 1 y=0,x =0 is not pseudo-Lipschiz. However, with additional justifications <ref type="bibr">[1]</ref>, one can show that the conclusion is still correct.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. ORACLE OPTIMALITY OF SLOPE</head><p>In this section, we will study the optimal design of the regularization sequence in SLOPE. Using the asymptotic characterization presented in Sec. II, we will derive the optimal limiting distribution F &#955; to achieve the best estimation or testing performance, given the oracle knowledge of F &#946; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Estimation with Minimum MSE</head><p>We first turn to the problem of finding the optimal &#955;sequence which minimizes the MSE of slope estimator. Since we work in the asymptotic regime, it boils down to finding the optimal distribution F * &#955; such that</p><p>= arg min</p><p>where B &#8764; F &#946; and the second equality follows from Theorem 1. From <ref type="bibr">(6)</ref>, this is further equivalent to finding F &#955; to minimize &#963;. However, directly searching over F &#955; appears unwieldy, since &#963; as a functional of F &#955; is defined indirectly through a nonlinear fixed point equation.</p><p>To simplify this problem, we first note that in ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>), the influence of F &#955; on the solution (&#963;, &#964; ) is only through the limiting scalar function &#951;. Therefore, instead of optimizing over F &#955; , we can find the optimal &#951; * and then calculate the corresponding F * &#955; . The next question then becomes finding all possible &#951; that can be realized by some F &#955; . In fact, we can compute all possible &#951;(&#8226;) associated with any converging sequence y (p)  p&#8712;N . Let</p><p>) holds be the functional space that &#951; belongs to. We have the following result: Proposition 2: For any converging sequence y (p) p&#8712;N , we have</p><p>and M is a convex set. Moreover, for any &#951; &#8712; M, the corresponding distribution of the &#955;-sequence that yields &#951; can be represented by: &#955; &#8764; |Y |&#951;(|Y |), where Y follows the limiting distribution of y (p)  p&#8712;N . Remark 2: Proposition 2 is the key ingredient in our optimal design. It shows that, with different choices of F &#955; , we can reach any nondecreasing and odd function that is Lipschitz continuous with constant 1. Clearly, the soft-thresholding functions associated with LASSO belongs to M, but the set M is much richer. This is the essence of how SLOPE generalizes LASSO: it allows for more degrees of freedom in the regularization.</p><p>Due to Proposition 2, the optimization problem can be simplified to that of finding the optimal &#951; &#8712; M such that &#963; as obtained from ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>) is minimized. Specifically, we need to find <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>)} . <ref type="bibr">(10)</ref> Note that equations ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>) involve two variables &#963; and &#964; . It is not easy to handle them simultaneously. A simplification we can make is to first set &#964; to 1 and find the minimum &#963; such that the first equation ( <ref type="formula">6</ref>) and the inequality E B,Z [&#951; (B + &#963;Z; F y , F &#955; )] &#8804; &#948; hold. Once we  get &#963; min , the corresponding &#964; can then be obtained via <ref type="bibr">(7)</ref>:</p><p>) -1 and &#955; is in turn updated to be &#955;/&#964; . After this replacement, ( <ref type="formula">6</ref>) and ( <ref type="formula">7</ref>) will be both satisfied. It is not difficult to show that this procedure will lead to the same &#963; min as defined in <ref type="bibr">(10)</ref>. Therefore, the remaining task is to solve, for every candidate &#963;, the following problem:</p><p>Thanks to the convexity of M, we can show that ( <ref type="formula">11</ref>) is an infinite-dimensional convex optimization problem. In practice, we can discretize over R to solve a finite-dimensional approximation. If the minimum value of ( <ref type="formula">11</ref>) is smaller than &#948;(&#963; 2&#963; 2 w ), then it can be shown that &#963; min &lt; &#963; and vice versa. Clearly, we only need to search for &#963; min over a compact set:</p><p>, since from (6), we know &#963; 2 &#8805; &#963; 2 w and also</p><p>] to find the minimum &#963;. Once we find the optimal &#951;(y) and (&#963; min , &#964; ), we know from Proposition 2 that the corresponding &#955; can be represented as:</p><p>In Fig. <ref type="figure">2</ref>, we compare the MSEs achieved by our optimal design with those obtained by LASSO and the BHq sequences proposed <ref type="bibr">[5]</ref>, at different SNR and sparsity levels. For fair comparison, we optimize the parameters of the BHq and LASSO sequences. It can be seen from the figure that the empirical minimum MSEs match well with theoretical ones. We observe from Fig. <ref type="figure">2a</ref> that, under low SNRs, the BHq sequence can lead to very similar performance as the oracle optimal design. However, at higher SNR levels, the optimal design outperforms the BHq sequence, while it gets close to LASSO. To unravel the underlying reason for this, we plot in Fig. <ref type="figure">3</ref> the distributions of the &#955;-sequences associated with the optimal design and the BHq design, respectively. It turns out that, in the low SNR case, the optimal design and BHq have similar distributions; at higher SNRs, the distribution of the optimal design is close to a delta-like distribution similar to LASSO.</p><p>Note that for small sparsity-level &#961;, LASSO can outperform BHq and achieve performance close to that of the optimal sequence, but it is prone to higher bias when &#961; grows. From Fig. <ref type="figure">2b</ref>, we can find that LASSO's performance degrades much faster than the other two as &#961; increases, This is because LASSO's penalization is not adaptive to the underlying sparsity levels <ref type="bibr">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Multiple Testing with Maximum Power</head><p>Next we consider using SLOPE for variable selection. For a given level of type-I error, we want to find the optimal regularization sequence to achieve the highest possible power.</p><p>As we have shown in the last section, in the asymptotic region, this is equivalent to optimizing over &#951; &#8712; M. Let y thresh = sup y&#8805;0 {y | &#951;(y) = 0}. It follows from Theorem 1 that, in order to ensure P type-I = &#945;, we need to have</p><p>, where &#934;(&#8226;) is the CDF of the standard normal distribution. Similarly, we can compute the power of the test as</p><p>)). It can be shown that for any fixed &#946;,</p><p>)) is a nonincreasing function of &#963;. Thus, under a given type-I error rate &#945;, maximizing the power is equivalent to minimizing &#963;.</p><p>Similar to the procedure described in Sec. III-A, we can traverse through a bounded set of &#963; to find &#963; min . The difference here is that we need to enforce additional constraints that guarantee P type-I = &#945;. We omit further details, which can be found in <ref type="bibr">[12]</ref>. In Fig. <ref type="figure">4</ref>, we compare the FDR curve of the optimal design with that of the BHq sequence. We verify that the optimal design indeed dominates the BHq sequence and that the empirical FDR curve matches well with theoretical prediction. &#8764; (1&#961;)&#948;(0) + &#961;N (&#181; 0 , &#963; 2 0 ) with &#961; = 0.25, &#181; 0 = 2.125 and &#963; 0 = 0, w i i.i.d.</p><p>&#8764; N (0, &#963; 2 ) with &#963; = 0.25. The results are average over 100 realizations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX</head><p>In this appendix, we briefly describe the procedure for constructing the limiting scalar function &#951;(&#8226;; F y , F &#955; ) in Proposition 1. The procedure, as summarized in Algorithm 1, can be viewed as the asymptotic limit of a fast algorithm proposed in <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref> for solving <ref type="bibr">(3)</ref>. Here, &#955;(y) can be intuitively understood as a function that assigns each y a regularization strength &#955; and G(y) just thresholds y using the assigned &#955;. This is exactly in the same spirit of magnitude-dependent regularization applied in SLOPE. The WHILE LOOP in Step 2-14 is essentially an adjustment of G(y) obtained in Step 1: within certain intervals [y L , y R ], the original G(y) is replaced by E (G(y) | y &#8712; [y L , y R ]). The WHILE LOOP ends until G(y) is nondecreasing.</p><p>For illustrations, we consider the simplest scenario when F Y is continuous and when G(y) as obtained in Step 1 is nondecreasing. In this case, WHILE LOOP in Step 2-14 will not be executed. It follows that &#951;(y; F y , F &#955; ) = sign(y) max{0, |y| -F -1 &#955; [F |y| (|y|)]}. Clearly, this reduces to the soft-thresholding function associate with LASSO, with the regularization parameter given by F -1 &#955; [F |y| (|y|)] &#8801; &#955;. Illustrations of several more complicated examples can be found in <ref type="bibr">[12]</ref>.    b Maximal nondecreasing interval (MNDI) is defined in the same way as MDI.</p></div></body>
		</text>
</TEI>
