SUPPLEMENT TO SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND ASYMPTOTICALLY MINIMAX. By Weijie Su and Emmanuel Candès Stanford University

Size: px

Start display at page:

Download "SUPPLEMENT TO SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND ASYMPTOTICALLY MINIMAX. By Weijie Su and Emmanuel Candès Stanford University"

Darlene Floyd
5 years ago
Views:

1 arxiv: arxiv: SUPPLEMENT TO SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND ASYMPTOTICALLY MINIMAX By Weijie Su and Emmanuel Candès Stanford University APPENDIX A: PROOFS OF TECHNICAL RESULTS As is standard, we write a n b n for two positive sequences a n and b n if there exist two constants C and C 2 possibly depending on q such that C a n b n C 2 a n for all n. Also, we write a n b n if a n /b n. A.. Proofs for Section 2. We remind the reader that the proofs in this subsection rely on some lemmas to be stated later in the Appendix. Proof of 2.. For simplicity, denote by β the full Lasso solution β Lasso, and b S the solution to the reduced Lasso problem minimize b R k 2 y X Sb 2 + λ b, where S is the support of the ground truth β. We show that i A. X S z + c/2 2 log p and ii A.2 X S X Sβ S b S < C k log 2 p/n for some constant C, both happen with probability tending to one. Now observe that X S y X S b S = X S z + X S X Sβ S b S. Hence, combining A. and A.2 and using the fact that k log p/n 0 give X S y X S b S X S X Sβ S b S + X S z C k log 2 p/n + + c/2 2 log p = o 2 log p + + c/2 2 log p < + c 2 log p Supported in part by a General Wang Yaowu Stanford Graduate Fellowship. Supported in part by NSF under grant CCF and by the Math + X Award from the Simons Foundation.

2 2 W. SU AND E. CANDÈS with probability approaching one. This last inequality together with the fact that b S obeys the KKT conditions for the reduced Lasso problem imply that padding b S with zeros on S obeys the KKT conditions for the full Lasso problem and is, therefore, solution. We need to justify A. and A.2. First, Lemmas A.6 and A.5 imply A.. Next, to show A.2, we rewrite the left-hand side in A.2 as X S X Sβ S b S = X S X SX SX S X Sy X S bs X Sz. By Lemma A.7, we have that X Sy X S bs X Sz kλ+ X Sz kλ+ 32k logp/k C k log p holds with probability at least e n/2 2ek/p k. In addition, Lemma A. with t = /2 gives XS X SX S /n k /n /2 < 3 with probability at least e n/8. Hence, from the last two inequalities it follows that A.3 X S X SX S X Sy X S bs X Sz C k log p with probability at least e n/2 2ek/p k e n/8. Since X S is independent of X S X S X S X S y X S b S X S z, Lemma A.6 gives X S X SX SX S X Sy X S bs X Sz 2 log p X S X n SX S X Sy X S bs X Sz with probability approaching one. Combining this with A.3 gives A.2. Let b S be the solution to minimize β S + X b R k 2 Sz b 2 + λ b. To complete the proof of 2., it suffices to establish i that for any constant δ > 0, b S β S 2 A.4 sup P β 0 k 2 + c 2 k log p > δ,

3 and ii A.5 b S b S = o P b S β S since A.4 and A.5 give A.6 sup P β 0 k SUPPLEMENT 3 b S β S c 2 k log p > δ for each δ > 0. Note that taking δ = / + c 2 in A.6 and using the fact that b S is solution to Lasso with probability approaching one finish the proof Proof of A.4. Let β i = if i S and otherwise zero treat as a sufficiently large positive constant. For each i S, b S,i = β i + X i z λ, and b S,i β i = X iz λ λ X iz. On the event {max i S X i z λ}, which happens with probability tending to one, this inequality gives b S β S 2 i S λ X iz 2 = kλ 2 2λ i S X iz + i S = + o P 2 + c 2 k log p, X iz 2 where we have used that both i S X i z2 and i S X i z are O Pk. This proves the claim. Proof of A.5. Apply Lemma 4.2 with T replaced by S here each of bs, b S and β is supported on S. Since k/p 0, for any constant δ > 0, all the singular values of X S lie in δ, +δ with overwhelming probability see, for example, [5]. Consequently, Lemma 4.2 ensures A.5. Proof of 2.2. We assume σ = and put λ = λ BH. As in the proof of Theorem., we decompose the total loss as β Seq β 2 = β Seq,S β S 2 + β Seq,S β S 2 = β Seq,S β S 2 + β Seq,S 2. The largest possible value of the loss off support is achieved when y S is sequentially soft-thresholded by λ [k]. Hence, by the proof of Lemma 3.3, we obtain E β Seq,S 2 = o 2k logp/k for all k-sparse β.

4 4 W. SU AND E. CANDÈS Now, we turn to consider the loss on support. For any i S, the loss is at most zi + λ ri 2 = λ 2 ri + z 2 i + 2 z i λ ri. Summing the above equalities over all i S gives E β Seq,S β S 2 k λ 2 i + zi z i λ ri. i= i S i S Note that the first term k i= λ2 i = + o 2k logp/k, and the second term has expectation E i S z2 i = k = o2k logp/k, so that it suffices to show that [ A.7 E 2 i S z i λ ri ] = o 2k logp/k. We emphasize that both z i and ri are random so that {λ ri } i S and {z i } i S may not be independent. Without loss of generality, assume S = {,..., k} and for i k, let r i be the rank of the ith observation among the first k. Since λ is nonincreasing and r i ri, we have i k z i λ ri i k z i λ r i i k z i λ i, where z z k are the order statistics of z,..., z k. The second inequality follows from the fact that for any nonnegative sequences {a i } and {b i }, i a ib i i a ib i. Therefore, letting ζ,..., ζ k be i.i.d. N 0,, A.7 follows from the estimate A.8 k λ i E ζ i = o 2k logp/k. i= To argue about A.8, we work with the approximations λ i 2 logp/i 2 and E ζ i = O log2k/i see e.g. A.5, so that the claim is a consequence of k log p i log 2k = o 2k logp/k, i i=

5 which is justified as follows: k log p i log 2k i i= k k SUPPLEMENT = C k log p/k x log 2 x dx log p log 2 k x + log x log 2 x 2 logp/k dx log p k + C 2 k logp/k for some absolute constants C, C 2. Since logp/k, it is clear that the right-hand side of the above display is of o2k logp/k. A.2. Proofs for Section 3. To begin with, we derive a dual formulation of the SLOPE program.6, which provides a nice geometrical interpretation. This dual formulation will also be used in the proof of Lemma 4.3. Our exposition largely borrows from [2]. Rewrite.6 as A.9 minimize b,r whose Lagrangian is 2 r 2 + i λ i b i subject to Xb + r = y, Lb, r, ν := 2 r 2 + i λ i b i ν Xb + r y. Hence, the dual objective is given by inf b,r Lb, r, ν = ν y sup {ν r 2 } { r 2 sup X ν b r b i { = ν y 0 ν C λ,x 2 ν 2 + otherwise, λ i b i } where C λ,x := {ν : X ν is majorized by λ} is a convex polytope. It thus follows that the dual reads A.0 maximize ν ν y 2 ν 2 subject to ν C λ,x. The equality ν y ν 2 /2 = y ν 2 /2 + y 2 /2 reveals that the dual solution ν is indeed the projection of y onto C λ,x. The minimization of

6 6 W. SU AND E. CANDÈS the Lagrangian over r is attained at r = ν. This implies that the primal solution β and the dual solution ν obey A. y X β = ν. We turn to proving the facts. Proof of Fact 3.. Without loss of generality, suppose both a and b are nonnegative and arranged in nonincreasing order. Denote by Tk a the sum of the first k terms of a with T0 a 0, and similarly for b. We have p p a 2 = a k Tk a T k a = Tk a a k a k+ +a p Tp a Similarly, k= k= p Tk b a k a k+ +a p Tp b = k= k= p a k b k. p b 2 = Tk b b k b k+ +b p Tp b Tk a b k b k+ +b p Tp a = k= which proves the claim. p k= p b k Tk a T k a = Proofs of Facts 3.2 and 3.3. Taking X = I p in the dual formulation, A. immediately implies that a prox λ a is the projection of a onto the polytope C λ,ip. By definition, C λ,ip consists of all vectors majorized by λ. Hence, a prox λ a is always majorized by λ. In particular, if a is majorized by λ, then the projection a prox λ a of a is identical to a itself. This gives prox λ a = 0. Proof of Fact 3.4. Assume a is nonnegative without loss of generality. It is intuitively obvious that b a = prox λ b prox λ a, where as usual b a means that b a R p +. In other words, if the observations increase, the fitted values do not decrease. To save time, we directly verify this claim by using Algorithm 3 FastProxSL from [2]. By the averaging step of that algorithm, we can see that for each i, j p, [prox λ a] i a j = { k= #{ k p: [prox λ a] k =[prox λ a] j }, prox λ a j = prox λ a i > 0, 0, otherwise. This holds for all a R p except for a set of measure zero. The nonnegativity of [prox λ a] i / a j along with the Lipschitz continuity of the prox imply p a k b k, k=

7 SUPPLEMENT 7 the monotonicity property. A consequence is that [prox λ a] T does not decrease as we let a i for all i T. In the limit, [prox λ a] T monotonically converges to prox λ T at. This gives the desired inequality. As a remark, we point out that the proofs of Facts 3.2 and 3.3 suggest a very simple proof of Lemma 3.. Since a prox λ a is the projection of a onto C λ,ip, prox λ a is thus the distance between a and the polytope C λ,ip. Hence, it suffices to find a point in the polytope at a distance of a λ + away from a. The point b defined as b i = min{ a i, λ i } does the job. Now, we proceed to prove the preparatory lemmas for Theorem., namely, Lemmas A.3 and A.4. The first two lemmas below can be found in []. Lemma A.. Let U be a Betaa, b random variable. Then E log U = log Γa log Γa + b, where Γ denotes the Gamma function and log Γx is the derivative with respect to x. Lemma A.2. For any integer m, m log Γm = γ + j = log m + O, m j= where γ = is the Euler constant. Lemma A.3. Let ζ N 0, I p k. Under the assumptions of Theorem., for any constant A > 0, Ak E ζ 2k logp/k i λk+i BH i= Proof of Lemma A.3. Write λ i = λ BH i for simplicity. It is sufficient to prove a stronger version in which the order statistics ζ i come from p i.i.d. N 0,. The reason is that the order statistics will be stochastically larger, thus enlarging E 2 ζ i λk+i BH +, since ζ λ2 + is nondecreasing in ζ. Applying the bias-variance decomposition, we get A.2 E 2 ζ i λ k+i + E 2 ζ i λ k+i = Var ζi + 2 E ζ i λ k+i.

8 8 W. SU AND E. CANDÈS We proceed to control each term separately. For the variance, a direct application of Proposition 4.2 in [3] gives A.3 i= Var ζ i = O i logp/i for all i p/2. Hence, Ak Ak Var ζ i = O = o2k logp/k, i logp/i i= where the last step makes use of logp/k. It remains to show that A.4 Ak i= E ζi λ k+i 2 = o2k logp/k. Let U,..., U p be i.i.d. uniform random variables on 0, and U i be the i th smallest please note that for a change, the U i s are sorted in increasing order. We know that U i is distributed as Betai, p + i and that ζ i has the same distribution as Φ U i /2. Making use of Lemmas A. and A.2 then gives E ζ 2 i = E [ Φ U i /2 2] E [ 2 log2/u i ] = 2 log 2+2 where the second step follows from + o P 2 log2/u i Φ U i /2 2 2 log2/u i for i = op. As a result, A.5 E ζ i E ζ 2 i = + o 2 logp/i E ζ i = E ζ 2 i Var ζ i = + o 2 logp/i. Similarly, since k + i = op and q is constant, we have the approximation λ k+i = + o 2 logp/k + i, which together with A.5 reveals that A.6 E ζi λ k+i 2 +o 2 [ logp/i logp/k + i ] 2+o logp/i. p j=i j = +o2 logp/i,

9 SUPPLEMENT 9 The second term in the right-hand side contributes at most o Ak logp/ak = o 2k logp/k in the sum A.4. For the first term, we get [ logp/i logp/k + i ] 2 = log 2 + k/i [ logp/i + logp/k + i ] 2 = o log 2 +k/i. Hence, it contributes at most A.7 o Ak i= Ak log 2 + k/i o = ok i= A 0 k i k i k log 2 + /xdx log 2 + /xdx = o2k logp/k. Combining A.7, A.6 and A.4 concludes the proof. Lemma A.4. Let ζ N 0, I p k and A > 0 be any constant satisfying q + A/A <. Then, under the assumptions of Theorem., 2k logp/k p k i= Ak E ζ i λ BH k+i Proof of Lemma A.4. Again, write λ i = λ BH i for simplicity. As in the proof of Lemma A.3 we work on a stronger version by assuming ζ N 0, I p. Denote by q = q + A/A. For any u 0, let α u := P N 0, > λ k+i + u = 2Φ λ k+i u. Then P ζ i > λ k+i + u is just the tail probability of the binomial distribution with p trials and success probability α u. By the Chernoff bound, this probability is bounded as A.8 P ζ i > λ k+i + u e p KLi/p αu, where KLa b := a log a b Note that A.9 KLi/p b b a + a log b is the Kullback-Leibler divergence. = i/p b α u + i/p b i pb + for all 0 < b < i/p. Hence, from A.9 it follows that A.20 α0 KL α0 KL i/p α u KL i/p α 0 = b db i pb db α u α0 e uλ k+iα 0 = iuλ k+i p i pb db α 0 e uλ k+i,

10 0 W. SU AND E. CANDÈS where the second inequality makes use of α u e uλ k+iα 0. With the proviso that q + A/A < and i Ak, it follows that A.2 α 0 = qk + i/p q i/p. Hence, substituting A.2 into A.20, we see that A.8 yields P ζ i > λ k+i + u e p KL i p αu KL i p α 0 A.22 e p KL i p αu KL i p α 0 e p KL i p α 0 exp iuλ k+i + q i exp uλ k+i. With this preparation, we conlude the proof of our lemma as follows: E 2 ζ i λ k+i + = P ζ i λ k+i 2 + > x dx and plugging A.22 gives E ζ i λ k+i = 2 0 λ 2 k+i 0 2 λ 2 p This yields the upper bound p k = = 2 0 u exp i= Ak E ζ i λ k+i λ 2 p Since the integrand obeys lim x 0 0 = 0 0 P ζ i > λ k+i + xdx up ζ i > λ k+i + udu, iuλ k+i + q i exp uλ k+i du xe x q e x i dx xe x q e x i dx. p k i= Ak 0 2 Φ q/2 2 2 Φ q/2 2 xe x q e x i dx i= 0 xe x q e x e x q e x = q 0 xe x q e x i dx xe x q e x e x q e x dx.

11 SUPPLEMENT and decays exponentially fast as x, we conclude that p k i= Ak E ζ i λ k+i 2 + is bounded by a constant. This is a bit more than we need since 2k logp/k. A.3. Proofs for Section 4. In this paper, we often use the Borell inequality to show that P N 0, I n > n + t exp t 2 /2. Lemma A.5 Borell s inequality. Let ζ N 0, I n and f be an L- Lipschitz continuous function in R n. Then for every t > 0. P fζ > E fζ + t e t2 2L 2 Lemma A.6. Let ζ,..., ζ p be i.i.d. N 0,. Then max ζ i 2 log p i holds with probability approaching one. The latter classical result can be proved in many different ways. Suffices to say that it follows from a more subtle fact, namely, that 2 log p max ζ i log log p + log 4π 2 log p + i 2 2 log p converges weakly to a Gumbel distribution [4]. Proof of Lemma 4.3. Let β lift be the lift of b T in the sense that β lift T = bt and β lift T = 0 and let T = m. Further, set ν := y X T b T = y X β lift. Applying A.0 and A. to the reduced SLOPE program, we get that X T ν λ[m]. By the assumption, X T ν is majorized by λ [m]. Hence, X ν the concatenation of X T ν and X T ν is majorized by λ = λ[m], λ [m]. This confirms that ν is dual feasible with respect to the full SLOPE program. If additionally we show that A.23 2 y X β lift 2 + i λ i β lift i = ν y 2 ν 2,

12 2 W. SU AND E. CANDÈS then the strong duality claims that β lift and ν must, respectively, be the optimal solutions to the full primal and dual. In fact, A.23 is self-evident. The right-hand side is the optimal value of the reduced dual i.e., replacing X and λ by X T and λ [m] in A.0, while the left-hand side agrees with the optimal value of the reduced primal since 2 y X β lift 2 = 2 y X b T 2 and p λ i β lift i = i= m λ i b T i. Since the reduced primal only has linear equality constraints and is clearly feasible, strong duality holds, and A.23 follows from this. Lemma A.7. i= Let k < p be any deterministic integer, then sup T =k X T z 32k logp/k with probability at least e n/2 2ek /p k. Above, the supremum is taken over all the subsets of {,..., p} with cardinality k. Proof of Lemma A.7. Conditional on z, it is easy to see that X z is distributed as i.i.d. centered Gaussian random variables with variance z 2 /n. This observation enables us to write X z d = z n ζ,..., ζ p, where ζ := ζ,..., ζ p consists of i.i.d. N 0, independent of z. Hence, it is sufficient to prove that z 2 n, ζ ζ 2 k 8k logp/k simultaneously with probability at least e n/2 2ek /p k. From Lemma A.5, we know that P z > 2 n e n/2 so we just need to establish the other inequality. To this end, observe that P ζ ζ 2 k > 8k logp/k E e 4 = ζ ζ 2 k e 2k log p k i < <i k E e 4 p k 2 k /2 e 2k log p k 2ek p k. e 2k log p k ζ 2 i + + ζ 2 i k

13 SUPPLEMENT 3 We record an elementary result which simply follows from Φ c/2 2 log /c for each 0 < c <. Lemma A.8. Fix 0 < q <. Then for all k p/2, k i= for some constant C q > 0. λ BH i 2 C q k logp/k, In the next two lemmas, we use the BHq critical values λ BH to majorize sequences of Gaussian order statistics. Again, a b means that b majorizes a. Lemma A.9. Given any constant c > / q, suppose max{ck, k+d} k < p for any deterministic sequence d that diverges to. Let ζ,..., ζ p k be i.i.d. N 0,. Then ζ k k+, ζ k k+2,..., ζ p k λ BH k +, λbh k +2,..., λbh p with probability approaching one. Proof of Lemma A.9. It suffices to prove the stronger case where ζ N 0, I p. Let U,..., U p be i.i.d. uniform random variables on [0, ] and U U p the corresponding order statistics. Since ζ k k+,..., ζ p k d = Φ U k k+/2,..., Φ U p k /2, the conclusion would follow from P U k k+j qk + j/p, j {,..., p k }. Let E,..., E p+ be i.i.d. exponential random variables with mean and denote by T i = E + + E i. Then the order statistics U i have the same joint distribution with T i /T p+. Fixing an arbitrary constant q q, /c, we have P U k k+j qk + j/p, j P T k k+j q k + j, j P T p+ > q p/q. Since P T p+ > q p/q 0 by the law of large numbers, it is sufficient to prove A.24 P T k k+j q k + j, j {,..., p k }.

14 4 W. SU AND E. CANDÈS This event can be rewritten as T k k+j T k k q j q k T k k for all j p k. Hence, A.24 is reduced to proving A.25 P min T j p k k k+j T k k q j q k T k k. As a random walk, T k k+j T k k q j has i.i.d. increments with mean q > 0 and variance. Thus min j p k T k k+j T k k q j converges weakly to a bounded random variable in distribution. Consequently, A.25 holds if one can demonstrate that q k T k k diverges to as p in probability. To see this, observe that q k T k k = q k k k k k T k k q c c k k T k k, where we use the fact k ck. Under our hypothesis q c/c <, the process {q ct/c T t : t N} is a random walk drifting towards. Recognizing that k k d, we see that q ck k/c T k k weakly diverges to since it corresponds to a position of the preceding random walk at t. This concludes the proof. Lemma A.0. Let ζ,..., ζ p k be i.i.d. N 0,. Then there exists a constant C q only depending on q such that log p ζ,..., ζ p k C q λ BH logp/k k+,..., λ BH p with probability tending to one as p and k/p 0. Proof of Lemma A.0. Let U,..., U p k be i.i.d. uniform random variables on [0, ] and replace ζ i by Φ U i /2. Note that Φ U i /2 2 log 2, λ BH k+i U 2 log 2p i k + i ; Hence, it suffices to prove that for some constant κ q, A.26 log2/u i logp/k κ q log p log2p/k + i

15 SUPPLEMENT 5 holds for all i =,..., p k with probability approaching one. Applying the representation given in the proof of Lemma A.9 and noting that T p+ = + o P p, we see that A.26 is implied by A.27 log3p/t i logp/k κ q log p log2p/k + i. We consider i 4 p and i > 4 p separately. Suppose first that i 4 p. In this case, Thus A.27 would follow from log2p/k + i = + o logp/k. log3p/t i = Olog p for all such i. This is, however, self-evident since T i E /p with probability e /p = o. Suppose now that i > 4 p. In this case, we make use of the fact that T i > i/2 p for all i with probability tending to one as p. Then we prove a stronger result, namely, log 3p i/2 p log p k κ q log p log 2p k + i. for all i > 4 p. This follows from the two observations below: log 3p i/2 p log p 2p, log {log i k + i min p i, log p }. k In the proofs of the next two lemmas, namely, Lemma A. and Lemma A.2, we introduce an orthogonal matrix Q R n n that obeys Qz = z, 0,..., 0. In the proofs, Q is further set to be measurable with respect to z. Hence, Q is independent of X. There are many options available to construct such a Q, including the Householder transformation. Set [ ] W = := QX, w W

16 6 W. SU AND E. CANDÈS where w R p and W R n p. The independence between Q and X suggests that W is still a Gaussian random matrix, consisting of i.i.d. N 0, /n entries. Note that X iz = QX i Qz = z QX i = z w i. This implies that S is constructed as the union of S and the k k indices in {,..., p} \ S with the largest w i. Since w and W are independent, we see that both W S and W S are also Gaussian random matrices. These points are crucial in the proof of these two lemmas. Lemma A.. Let k < k < min{n, p} be any deterministic integer. Denote by σ min and σ max, respectively, the smallest and the largest singular value of X S. Then for any t > 0, σ min > /n k /n t holds with probability at least e nt2 /2. Furthermore, σ max < /n + k /n + 8k logp/k /n + t holds with probability at least e nt2 /2 2ek /p k. Proof of Lemma A.. Recall that W S R n k is a Gaussian design with i.i.d. N 0, /n entries. Since W S and X S have the the same set of singular values, we consider W S. Classical theory on Wishart matrices see [5], for example asserts that i all the singular values of W S are larger than /n k /n t with probability at least e nt2 /2, and ii are all smaller than /n + k /n + t with probability at least e nt2 /2. Clearly, all the singular values larger of W S are at least as large as σ min W S. Thus, i yields the first claim. For the other, Lemma A.7 asserts that the event w S 8k logp/k happens with probability at least 2ek /p k. On this event, W S W S 2 + 8k logp/k, where denotes the spectral norm. Hence, ii gives W S W S + 8k logp/k /n+ k /n+t+ 8k logp/k with probability at least e nt2 /2 2ek /p k.

17 SUPPLEMENT 7 Lemma A.2. Denote by b S the solution to the reduced SLOPE problem 4.6 with T = S and λ = λ ɛ. Keep the assumptions from Lemma A.9, and additionally assume k / min{n, p} 0. Then there exists a constant C q only depending on q such that X X S S β S b k S C q log p λ BH n k +,..., λbh p with probability tending to one. Proof of Lemma A.2. In this proof, C is a constant that only depends on q and whose value may change at each occurrence. Rearrange the objective term as X S X S β S b S = X S X S X S X S X S y X S bs X S z = X S Q QX S X S X S X S y X S bs X S z = X S Q ξ, where ξ := QX S X S X S X S y X S bs X S z. For future usage, note that ξ only depends on w and W S and is, therefore, independent of W S. We begin by bounding ξ. It follows from the KKT condition of SLOPE that X S y X S bs is majorized by λ [k ]. Hence, it follows from Fact 3. that A.28 X S y X S bs λ [k ]. Lemma A. with t = /2 gives A.29 X S X S X S /n k /n /2 < 2.0 with probability at least e n/8 for sufficiently large p, where in the last step we have used k /n 0. Hence, from A.28 and A.29 we get ξ X S X S X S X S y X S bs X S z 2.0 λ [k ] + 4 2k logp/k A ɛ C k logp/k = C k logp/k

18 8 W. SU AND E. CANDÈS with probability at least e n/2 2ek /p k e n/8 ; we used Lemma A.7 in the second line and Lemma A.8 in the third. A.30 will help us in finishing the proof. Write A.3 X X S S β S b S = X Q ξ = W ξ = w, 0 ξ+ 0, W S ξ. S S S It follows from Lemma A.9 that w S is majorized by λ BH k +, λbh k +2,..., λbh p / n in probability. As a result, the first term in the right-hand side obeys A.32 w, 0 k ξ = ξ S w ξ w C S S n log p λ BH k k +, λbh k +2,..., λbh p with probability tending to one. For the second term, by exploiting the independence between ξ and W S, we have 0, W S ξ = d ξ ξn 2 ζ,..., ζ p k, n where ζ,..., ζ p k are i.i.d. N 0, /n. Since k /p 0, applying Lemma A.0 gives log p ζ,..., ζ p k C λ BH logp/k k +,..., λbh p with probability approaching one. Hence, owing to A.30, A.33 0, W k S ξ C log p λ BH n k +,..., λbh p holds with probability approaching one. Finally, combining A.32 and A.33 gives that X X S S β S b S = w, 0 ξ + 0, W S ξ S k C n log p k k + log p λbh k n +,..., λbh p k C log p λ BH n k +,..., λbh p holds with probability tending to one.

19 A.4. Proofs for Section 5. SUPPLEMENT 9 Lemma A.3. Keep the assumptions from Lemma 5. and let ζ,..., ζ p be i.i.d. N 0,. Then in probability. # {2 i p : ζ i > τ + ζ } Proof of Lemma A.3. With probability tending to one, τ := τ + ζ also obeys τ / 2 log p and 2 log p τ. This shows that we only need to prove a simpler version of this lemma, namely, # { i p : ζ i > τ} in probability. Put = 2 log p τ = o 2 log p and a = Pξ > τ. Then, # { i p : ζ i > τ} is a binomial random variable with p trials and success probability a. Hence, it suffices to demonstrate that ap. To this end, note that a = Φτ τ 2π e τ2 2 2 log p e log p 2 /2+ 2 log p which gives ap = p 2 log p 2 log p e+o, 2 log p e +o 2 log p. Since in fact, it is sufficient to have bounded away from 0 from below, we have e +o 2 log p / 2 log p, as we wish. Proof of Lemma 5.. For sufficiently large p, 2 ɛ log p ɛ/2τ 2. Hence, it is sufficient to show P π β β 2 ɛ/2τ 2 0 uniformly for all estimators β. Letting I be the random coordinate, β β 2 = j I β 2 j + β I τ 2 = β 2 + τ 2 2τ β I, which is smaller than or equal to ɛ/2τ 2 if and only if β I 2 β 2 + ɛτ 2. 4τ

20 20 W. SU AND E. CANDÈS Denote by A = Ay; β the set of all i {,..., p} such that β i 2 β 2 + ɛτ 2 /4τ, and let b be the minimum value of these β i. Then which gives b 2 β 2 + ɛτ 2 4τ 2 A b 2 + ɛτ 2 4τ 2 2 A b 2 ɛτ 2, 4τ A.34 A 2/ɛ. Recall that β β 2 ɛ/2τ 2 if and only if I is among these A components. Hence, A.35 P π β β 2 ɛ/2τ 2 y = P π I A y = i A P π I = i y = eτy i p, i A i= eτy i where we use the fact that A is almost surely determined by y. Since A.35 is maximal if A is the set of indices with the largest y i s, A.35 and A.34 together yield P π β β 2 ɛ/2τ 2 P π y I = τ + z I is at least the 2/ɛ th largest among y,..., y p 0, where the last step is provided by Lemma A.3. Proof of Lemma 5.3. To closely follow the proof of Lemma 5., denote by A = Ay, X; β the set of all i {,..., p} such that β i 2 β 2 + ɛα 2 τ 2 /4ατ, and keep the same notation b as before. Then β β 2 ɛ/2α 2 τ 2 if and only if I A. Hence, A.36 P π β β 2 ɛ/2α 2 τ 2 y, X = P π I A y, X = i A P π I = i y, X = i A exp ατx i y α2 τ 2 X i 2 /2 p i= exp ατx i y α2 τ 2 X i 2 /2 and this quantity is maximal if A is the set of indices i with the largest values of X i y/α τ X i 2 /2. As shown in Lemma 5., A 2/ɛ, which

21 gives SUPPLEMENT 2 A.37 P π β β 2 ɛ/2α 2 τ 2 P π X Iy/α τ X I 2 /2 is at least the 2/ɛ th largest. We complete the proof by showing that the probability in the right-hand side of A.37 is negligible uniformly over all estimators β as p. By the independence between I and X, z, we can assume I = while evaluating this probability. With this in mind, we aim to show that there are sufficiently many i s such that X iy/α τ 2 X i 2 X y/α+ τ 2 X 2 = X i z/α + τx τ 2 X i 2 X z/α τ 2 X 2 is positive. Since X z/α + τ 2 X 2 = O P /α + τ 2 + OP / n, it suffices to show that A.38 { # 2 i p : X i z/α + τx τ 2 X i 2 > C α + τ 2 + C } 2τ 2/ɛ n holds with vanishing probability for all positive constants C, C 2. By the independence between X i and z/α + τx, we can replace z/α + τx by z/α + τx, 0,..., 0 in A.38. That is, X i z/α + τx τ 2 X i 2 d = z/α + τx X i, τ 2 X2 i, τ 2 X i, 2, where X i, R n is X i without the first entry. To this end, we point out that the following three events all happen with probability tending to one: A.39 #{2 i p : X i, }/p /2, max Xi, 2 2 log p i n, n z/α + τx log p /α. Making use of this and A.38, we only need to show that { N # 2 i 0.49p : nxi, log p/n > τ log p α n nxi, > τ + τ log p # { 2 i 0.49p : log p/n α n + τ 2 + C α + τ 2 + C } 2τ n + C α + C } 2τ n

22 22 W. SU AND E. CANDÈS obeys A.40 N 2/ɛ with vanishing probability. The first line of A.39 shows that there are at least 0.49p many i s such that X i, and we assume they correspond to indices 2 i 0.49p without loss of generality. Note that N is independent of all X i, s. Observe that τ := τ + τlog p/n + C /α + C 2 τ/ n log p = α + 2 τ + O log p/n /α n for sufficiently large p to ensure log p/n is small. Hence, plugging the specific choice of τ and using α, we obtain τ + 2 log p/n τ + O 2 log p log 2 log p + O, which reveals that 2 log0.49p τ = 2 log p τ + o. Since nxn,i are i.i.d. N 0,, Lemma A.3 validates A.40. Proof of Corollary.5. Let c > 0 be a sufficiently small constant to be determined later. It is sufficient to prove the claim with p replaced by a possibly smaller value given by p := min{ cn, p} if we knew that β i = 0 for p + i p, the loss of any estimator X β would not increase after projecting onto the linear space spanned by the first p columns. Hereafter, we assume X R n p and β R p. Observe that p = On implies p = Op and, therefore, A.4 logp /k logp/k. In particular, k/p 0 and n/ logp /k. This suggests that we can apply Theorem 5.4 to our problem, obtaining inf sup P β β 2 β β 0 k 2k logp /k > ɛ. for every constant ɛ > 0. Because of A.4, we also have A.42 inf sup P β β 2 β β 0 k 2k logp/k > ɛ for any ɛ > 0.

23 SUPPLEMENT 23 Since p /n c, the smallest singular value of the Gaussian random matrix X is at least c + o P see, for example, [5]. This result, together with A.42, yields inf β sup P β 0 k X β Xβ 2 2k logp/k > c 2 ɛ for each ɛ > 0. Finally, choose c and ɛ sufficiently small such that c 2 ɛ > ɛ. REFERENCES [] M. Abramowitz and I. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Courier Corporation, 202. [2] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. SLOPE adaptive variable selection via convex optimization. arxiv preprint arxiv: , 204. [3] S. Boucheron and M. Thomas. Concentration inequalities for order statistics. Electronic Communications in Probability, 75: 2, 202. [4] L. de Haan and A. Ferreira. Extreme value theory: An introduction. Springer Science & Business Media, [5] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications, 202. Departments of Statistics Stanford University Stanford, CA USA wjsu@stanford.edu Departments of Statistics and Mathematics Stanford University Stanford, CA USA candes@stanford.edu

Optimization methods

Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,