SUPPLEMENT TO SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND ASYMPTOTICALLY MINIMAX. By Weijie Su and Emmanuel Candès Stanford University
|
|
- Darlene Floyd
- 5 years ago
- Views:
Transcription
1 arxiv: arxiv: SUPPLEMENT TO SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND ASYMPTOTICALLY MINIMAX By Weijie Su and Emmanuel Candès Stanford University APPENDIX A: PROOFS OF TECHNICAL RESULTS As is standard, we write a n b n for two positive sequences a n and b n if there exist two constants C and C 2 possibly depending on q such that C a n b n C 2 a n for all n. Also, we write a n b n if a n /b n. A.. Proofs for Section 2. We remind the reader that the proofs in this subsection rely on some lemmas to be stated later in the Appendix. Proof of 2.. For simplicity, denote by β the full Lasso solution β Lasso, and b S the solution to the reduced Lasso problem minimize b R k 2 y X Sb 2 + λ b, where S is the support of the ground truth β. We show that i A. X S z + c/2 2 log p and ii A.2 X S X Sβ S b S < C k log 2 p/n for some constant C, both happen with probability tending to one. Now observe that X S y X S b S = X S z + X S X Sβ S b S. Hence, combining A. and A.2 and using the fact that k log p/n 0 give X S y X S b S X S X Sβ S b S + X S z C k log 2 p/n + + c/2 2 log p = o 2 log p + + c/2 2 log p < + c 2 log p Supported in part by a General Wang Yaowu Stanford Graduate Fellowship. Supported in part by NSF under grant CCF and by the Math + X Award from the Simons Foundation.
2 2 W. SU AND E. CANDÈS with probability approaching one. This last inequality together with the fact that b S obeys the KKT conditions for the reduced Lasso problem imply that padding b S with zeros on S obeys the KKT conditions for the full Lasso problem and is, therefore, solution. We need to justify A. and A.2. First, Lemmas A.6 and A.5 imply A.. Next, to show A.2, we rewrite the left-hand side in A.2 as X S X Sβ S b S = X S X SX SX S X Sy X S bs X Sz. By Lemma A.7, we have that X Sy X S bs X Sz kλ+ X Sz kλ+ 32k logp/k C k log p holds with probability at least e n/2 2ek/p k. In addition, Lemma A. with t = /2 gives XS X SX S /n k /n /2 < 3 with probability at least e n/8. Hence, from the last two inequalities it follows that A.3 X S X SX S X Sy X S bs X Sz C k log p with probability at least e n/2 2ek/p k e n/8. Since X S is independent of X S X S X S X S y X S b S X S z, Lemma A.6 gives X S X SX SX S X Sy X S bs X Sz 2 log p X S X n SX S X Sy X S bs X Sz with probability approaching one. Combining this with A.3 gives A.2. Let b S be the solution to minimize β S + X b R k 2 Sz b 2 + λ b. To complete the proof of 2., it suffices to establish i that for any constant δ > 0, b S β S 2 A.4 sup P β 0 k 2 + c 2 k log p > δ,
3 and ii A.5 b S b S = o P b S β S since A.4 and A.5 give A.6 sup P β 0 k SUPPLEMENT 3 b S β S c 2 k log p > δ for each δ > 0. Note that taking δ = / + c 2 in A.6 and using the fact that b S is solution to Lasso with probability approaching one finish the proof Proof of A.4. Let β i = if i S and otherwise zero treat as a sufficiently large positive constant. For each i S, b S,i = β i + X i z λ, and b S,i β i = X iz λ λ X iz. On the event {max i S X i z λ}, which happens with probability tending to one, this inequality gives b S β S 2 i S λ X iz 2 = kλ 2 2λ i S X iz + i S = + o P 2 + c 2 k log p, X iz 2 where we have used that both i S X i z2 and i S X i z are O Pk. This proves the claim. Proof of A.5. Apply Lemma 4.2 with T replaced by S here each of bs, b S and β is supported on S. Since k/p 0, for any constant δ > 0, all the singular values of X S lie in δ, +δ with overwhelming probability see, for example, [5]. Consequently, Lemma 4.2 ensures A.5. Proof of 2.2. We assume σ = and put λ = λ BH. As in the proof of Theorem., we decompose the total loss as β Seq β 2 = β Seq,S β S 2 + β Seq,S β S 2 = β Seq,S β S 2 + β Seq,S 2. The largest possible value of the loss off support is achieved when y S is sequentially soft-thresholded by λ [k]. Hence, by the proof of Lemma 3.3, we obtain E β Seq,S 2 = o 2k logp/k for all k-sparse β.
4 4 W. SU AND E. CANDÈS Now, we turn to consider the loss on support. For any i S, the loss is at most zi + λ ri 2 = λ 2 ri + z 2 i + 2 z i λ ri. Summing the above equalities over all i S gives E β Seq,S β S 2 k λ 2 i + zi z i λ ri. i= i S i S Note that the first term k i= λ2 i = + o 2k logp/k, and the second term has expectation E i S z2 i = k = o2k logp/k, so that it suffices to show that [ A.7 E 2 i S z i λ ri ] = o 2k logp/k. We emphasize that both z i and ri are random so that {λ ri } i S and {z i } i S may not be independent. Without loss of generality, assume S = {,..., k} and for i k, let r i be the rank of the ith observation among the first k. Since λ is nonincreasing and r i ri, we have i k z i λ ri i k z i λ r i i k z i λ i, where z z k are the order statistics of z,..., z k. The second inequality follows from the fact that for any nonnegative sequences {a i } and {b i }, i a ib i i a ib i. Therefore, letting ζ,..., ζ k be i.i.d. N 0,, A.7 follows from the estimate A.8 k λ i E ζ i = o 2k logp/k. i= To argue about A.8, we work with the approximations λ i 2 logp/i 2 and E ζ i = O log2k/i see e.g. A.5, so that the claim is a consequence of k log p i log 2k = o 2k logp/k, i i=
5 which is justified as follows: k log p i log 2k i i= k k SUPPLEMENT = C k log p/k x log 2 x dx log p log 2 k x + log x log 2 x 2 logp/k dx log p k + C 2 k logp/k for some absolute constants C, C 2. Since logp/k, it is clear that the right-hand side of the above display is of o2k logp/k. A.2. Proofs for Section 3. To begin with, we derive a dual formulation of the SLOPE program.6, which provides a nice geometrical interpretation. This dual formulation will also be used in the proof of Lemma 4.3. Our exposition largely borrows from [2]. Rewrite.6 as A.9 minimize b,r whose Lagrangian is 2 r 2 + i λ i b i subject to Xb + r = y, Lb, r, ν := 2 r 2 + i λ i b i ν Xb + r y. Hence, the dual objective is given by inf b,r Lb, r, ν = ν y sup {ν r 2 } { r 2 sup X ν b r b i { = ν y 0 ν C λ,x 2 ν 2 + otherwise, λ i b i } where C λ,x := {ν : X ν is majorized by λ} is a convex polytope. It thus follows that the dual reads A.0 maximize ν ν y 2 ν 2 subject to ν C λ,x. The equality ν y ν 2 /2 = y ν 2 /2 + y 2 /2 reveals that the dual solution ν is indeed the projection of y onto C λ,x. The minimization of
6 6 W. SU AND E. CANDÈS the Lagrangian over r is attained at r = ν. This implies that the primal solution β and the dual solution ν obey A. y X β = ν. We turn to proving the facts. Proof of Fact 3.. Without loss of generality, suppose both a and b are nonnegative and arranged in nonincreasing order. Denote by Tk a the sum of the first k terms of a with T0 a 0, and similarly for b. We have p p a 2 = a k Tk a T k a = Tk a a k a k+ +a p Tp a Similarly, k= k= p Tk b a k a k+ +a p Tp b = k= k= p a k b k. p b 2 = Tk b b k b k+ +b p Tp b Tk a b k b k+ +b p Tp a = k= which proves the claim. p k= p b k Tk a T k a = Proofs of Facts 3.2 and 3.3. Taking X = I p in the dual formulation, A. immediately implies that a prox λ a is the projection of a onto the polytope C λ,ip. By definition, C λ,ip consists of all vectors majorized by λ. Hence, a prox λ a is always majorized by λ. In particular, if a is majorized by λ, then the projection a prox λ a of a is identical to a itself. This gives prox λ a = 0. Proof of Fact 3.4. Assume a is nonnegative without loss of generality. It is intuitively obvious that b a = prox λ b prox λ a, where as usual b a means that b a R p +. In other words, if the observations increase, the fitted values do not decrease. To save time, we directly verify this claim by using Algorithm 3 FastProxSL from [2]. By the averaging step of that algorithm, we can see that for each i, j p, [prox λ a] i a j = { k= #{ k p: [prox λ a] k =[prox λ a] j }, prox λ a j = prox λ a i > 0, 0, otherwise. This holds for all a R p except for a set of measure zero. The nonnegativity of [prox λ a] i / a j along with the Lipschitz continuity of the prox imply p a k b k, k=
7 SUPPLEMENT 7 the monotonicity property. A consequence is that [prox λ a] T does not decrease as we let a i for all i T. In the limit, [prox λ a] T monotonically converges to prox λ T at. This gives the desired inequality. As a remark, we point out that the proofs of Facts 3.2 and 3.3 suggest a very simple proof of Lemma 3.. Since a prox λ a is the projection of a onto C λ,ip, prox λ a is thus the distance between a and the polytope C λ,ip. Hence, it suffices to find a point in the polytope at a distance of a λ + away from a. The point b defined as b i = min{ a i, λ i } does the job. Now, we proceed to prove the preparatory lemmas for Theorem., namely, Lemmas A.3 and A.4. The first two lemmas below can be found in []. Lemma A.. Let U be a Betaa, b random variable. Then E log U = log Γa log Γa + b, where Γ denotes the Gamma function and log Γx is the derivative with respect to x. Lemma A.2. For any integer m, m log Γm = γ + j = log m + O, m j= where γ = is the Euler constant. Lemma A.3. Let ζ N 0, I p k. Under the assumptions of Theorem., for any constant A > 0, Ak E ζ 2k logp/k i λk+i BH i= Proof of Lemma A.3. Write λ i = λ BH i for simplicity. It is sufficient to prove a stronger version in which the order statistics ζ i come from p i.i.d. N 0,. The reason is that the order statistics will be stochastically larger, thus enlarging E 2 ζ i λk+i BH +, since ζ λ2 + is nondecreasing in ζ. Applying the bias-variance decomposition, we get A.2 E 2 ζ i λ k+i + E 2 ζ i λ k+i = Var ζi + 2 E ζ i λ k+i.
8 8 W. SU AND E. CANDÈS We proceed to control each term separately. For the variance, a direct application of Proposition 4.2 in [3] gives A.3 i= Var ζ i = O i logp/i for all i p/2. Hence, Ak Ak Var ζ i = O = o2k logp/k, i logp/i i= where the last step makes use of logp/k. It remains to show that A.4 Ak i= E ζi λ k+i 2 = o2k logp/k. Let U,..., U p be i.i.d. uniform random variables on 0, and U i be the i th smallest please note that for a change, the U i s are sorted in increasing order. We know that U i is distributed as Betai, p + i and that ζ i has the same distribution as Φ U i /2. Making use of Lemmas A. and A.2 then gives E ζ 2 i = E [ Φ U i /2 2] E [ 2 log2/u i ] = 2 log 2+2 where the second step follows from + o P 2 log2/u i Φ U i /2 2 2 log2/u i for i = op. As a result, A.5 E ζ i E ζ 2 i = + o 2 logp/i E ζ i = E ζ 2 i Var ζ i = + o 2 logp/i. Similarly, since k + i = op and q is constant, we have the approximation λ k+i = + o 2 logp/k + i, which together with A.5 reveals that A.6 E ζi λ k+i 2 +o 2 [ logp/i logp/k + i ] 2+o logp/i. p j=i j = +o2 logp/i,
9 SUPPLEMENT 9 The second term in the right-hand side contributes at most o Ak logp/ak = o 2k logp/k in the sum A.4. For the first term, we get [ logp/i logp/k + i ] 2 = log 2 + k/i [ logp/i + logp/k + i ] 2 = o log 2 +k/i. Hence, it contributes at most A.7 o Ak i= Ak log 2 + k/i o = ok i= A 0 k i k i k log 2 + /xdx log 2 + /xdx = o2k logp/k. Combining A.7, A.6 and A.4 concludes the proof. Lemma A.4. Let ζ N 0, I p k and A > 0 be any constant satisfying q + A/A <. Then, under the assumptions of Theorem., 2k logp/k p k i= Ak E ζ i λ BH k+i Proof of Lemma A.4. Again, write λ i = λ BH i for simplicity. As in the proof of Lemma A.3 we work on a stronger version by assuming ζ N 0, I p. Denote by q = q + A/A. For any u 0, let α u := P N 0, > λ k+i + u = 2Φ λ k+i u. Then P ζ i > λ k+i + u is just the tail probability of the binomial distribution with p trials and success probability α u. By the Chernoff bound, this probability is bounded as A.8 P ζ i > λ k+i + u e p KLi/p αu, where KLa b := a log a b Note that A.9 KLi/p b b a + a log b is the Kullback-Leibler divergence. = i/p b α u + i/p b i pb + for all 0 < b < i/p. Hence, from A.9 it follows that A.20 α0 KL α0 KL i/p α u KL i/p α 0 = b db i pb db α u α0 e uλ k+iα 0 = iuλ k+i p i pb db α 0 e uλ k+i,
10 0 W. SU AND E. CANDÈS where the second inequality makes use of α u e uλ k+iα 0. With the proviso that q + A/A < and i Ak, it follows that A.2 α 0 = qk + i/p q i/p. Hence, substituting A.2 into A.20, we see that A.8 yields P ζ i > λ k+i + u e p KL i p αu KL i p α 0 A.22 e p KL i p αu KL i p α 0 e p KL i p α 0 exp iuλ k+i + q i exp uλ k+i. With this preparation, we conlude the proof of our lemma as follows: E 2 ζ i λ k+i + = P ζ i λ k+i 2 + > x dx and plugging A.22 gives E ζ i λ k+i = 2 0 λ 2 k+i 0 2 λ 2 p This yields the upper bound p k = = 2 0 u exp i= Ak E ζ i λ k+i λ 2 p Since the integrand obeys lim x 0 0 = 0 0 P ζ i > λ k+i + xdx up ζ i > λ k+i + udu, iuλ k+i + q i exp uλ k+i du xe x q e x i dx xe x q e x i dx. p k i= Ak 0 2 Φ q/2 2 2 Φ q/2 2 xe x q e x i dx i= 0 xe x q e x e x q e x = q 0 xe x q e x i dx xe x q e x e x q e x dx.
11 SUPPLEMENT and decays exponentially fast as x, we conclude that p k i= Ak E ζ i λ k+i 2 + is bounded by a constant. This is a bit more than we need since 2k logp/k. A.3. Proofs for Section 4. In this paper, we often use the Borell inequality to show that P N 0, I n > n + t exp t 2 /2. Lemma A.5 Borell s inequality. Let ζ N 0, I n and f be an L- Lipschitz continuous function in R n. Then for every t > 0. P fζ > E fζ + t e t2 2L 2 Lemma A.6. Let ζ,..., ζ p be i.i.d. N 0,. Then max ζ i 2 log p i holds with probability approaching one. The latter classical result can be proved in many different ways. Suffices to say that it follows from a more subtle fact, namely, that 2 log p max ζ i log log p + log 4π 2 log p + i 2 2 log p converges weakly to a Gumbel distribution [4]. Proof of Lemma 4.3. Let β lift be the lift of b T in the sense that β lift T = bt and β lift T = 0 and let T = m. Further, set ν := y X T b T = y X β lift. Applying A.0 and A. to the reduced SLOPE program, we get that X T ν λ[m]. By the assumption, X T ν is majorized by λ [m]. Hence, X ν the concatenation of X T ν and X T ν is majorized by λ = λ[m], λ [m]. This confirms that ν is dual feasible with respect to the full SLOPE program. If additionally we show that A.23 2 y X β lift 2 + i λ i β lift i = ν y 2 ν 2,
12 2 W. SU AND E. CANDÈS then the strong duality claims that β lift and ν must, respectively, be the optimal solutions to the full primal and dual. In fact, A.23 is self-evident. The right-hand side is the optimal value of the reduced dual i.e., replacing X and λ by X T and λ [m] in A.0, while the left-hand side agrees with the optimal value of the reduced primal since 2 y X β lift 2 = 2 y X b T 2 and p λ i β lift i = i= m λ i b T i. Since the reduced primal only has linear equality constraints and is clearly feasible, strong duality holds, and A.23 follows from this. Lemma A.7. i= Let k < p be any deterministic integer, then sup T =k X T z 32k logp/k with probability at least e n/2 2ek /p k. Above, the supremum is taken over all the subsets of {,..., p} with cardinality k. Proof of Lemma A.7. Conditional on z, it is easy to see that X z is distributed as i.i.d. centered Gaussian random variables with variance z 2 /n. This observation enables us to write X z d = z n ζ,..., ζ p, where ζ := ζ,..., ζ p consists of i.i.d. N 0, independent of z. Hence, it is sufficient to prove that z 2 n, ζ ζ 2 k 8k logp/k simultaneously with probability at least e n/2 2ek /p k. From Lemma A.5, we know that P z > 2 n e n/2 so we just need to establish the other inequality. To this end, observe that P ζ ζ 2 k > 8k logp/k E e 4 = ζ ζ 2 k e 2k log p k i < <i k E e 4 p k 2 k /2 e 2k log p k 2ek p k. e 2k log p k ζ 2 i + + ζ 2 i k
13 SUPPLEMENT 3 We record an elementary result which simply follows from Φ c/2 2 log /c for each 0 < c <. Lemma A.8. Fix 0 < q <. Then for all k p/2, k i= for some constant C q > 0. λ BH i 2 C q k logp/k, In the next two lemmas, we use the BHq critical values λ BH to majorize sequences of Gaussian order statistics. Again, a b means that b majorizes a. Lemma A.9. Given any constant c > / q, suppose max{ck, k+d} k < p for any deterministic sequence d that diverges to. Let ζ,..., ζ p k be i.i.d. N 0,. Then ζ k k+, ζ k k+2,..., ζ p k λ BH k +, λbh k +2,..., λbh p with probability approaching one. Proof of Lemma A.9. It suffices to prove the stronger case where ζ N 0, I p. Let U,..., U p be i.i.d. uniform random variables on [0, ] and U U p the corresponding order statistics. Since ζ k k+,..., ζ p k d = Φ U k k+/2,..., Φ U p k /2, the conclusion would follow from P U k k+j qk + j/p, j {,..., p k }. Let E,..., E p+ be i.i.d. exponential random variables with mean and denote by T i = E + + E i. Then the order statistics U i have the same joint distribution with T i /T p+. Fixing an arbitrary constant q q, /c, we have P U k k+j qk + j/p, j P T k k+j q k + j, j P T p+ > q p/q. Since P T p+ > q p/q 0 by the law of large numbers, it is sufficient to prove A.24 P T k k+j q k + j, j {,..., p k }.
14 4 W. SU AND E. CANDÈS This event can be rewritten as T k k+j T k k q j q k T k k for all j p k. Hence, A.24 is reduced to proving A.25 P min T j p k k k+j T k k q j q k T k k. As a random walk, T k k+j T k k q j has i.i.d. increments with mean q > 0 and variance. Thus min j p k T k k+j T k k q j converges weakly to a bounded random variable in distribution. Consequently, A.25 holds if one can demonstrate that q k T k k diverges to as p in probability. To see this, observe that q k T k k = q k k k k k T k k q c c k k T k k, where we use the fact k ck. Under our hypothesis q c/c <, the process {q ct/c T t : t N} is a random walk drifting towards. Recognizing that k k d, we see that q ck k/c T k k weakly diverges to since it corresponds to a position of the preceding random walk at t. This concludes the proof. Lemma A.0. Let ζ,..., ζ p k be i.i.d. N 0,. Then there exists a constant C q only depending on q such that log p ζ,..., ζ p k C q λ BH logp/k k+,..., λ BH p with probability tending to one as p and k/p 0. Proof of Lemma A.0. Let U,..., U p k be i.i.d. uniform random variables on [0, ] and replace ζ i by Φ U i /2. Note that Φ U i /2 2 log 2, λ BH k+i U 2 log 2p i k + i ; Hence, it suffices to prove that for some constant κ q, A.26 log2/u i logp/k κ q log p log2p/k + i
15 SUPPLEMENT 5 holds for all i =,..., p k with probability approaching one. Applying the representation given in the proof of Lemma A.9 and noting that T p+ = + o P p, we see that A.26 is implied by A.27 log3p/t i logp/k κ q log p log2p/k + i. We consider i 4 p and i > 4 p separately. Suppose first that i 4 p. In this case, Thus A.27 would follow from log2p/k + i = + o logp/k. log3p/t i = Olog p for all such i. This is, however, self-evident since T i E /p with probability e /p = o. Suppose now that i > 4 p. In this case, we make use of the fact that T i > i/2 p for all i with probability tending to one as p. Then we prove a stronger result, namely, log 3p i/2 p log p k κ q log p log 2p k + i. for all i > 4 p. This follows from the two observations below: log 3p i/2 p log p 2p, log {log i k + i min p i, log p }. k In the proofs of the next two lemmas, namely, Lemma A. and Lemma A.2, we introduce an orthogonal matrix Q R n n that obeys Qz = z, 0,..., 0. In the proofs, Q is further set to be measurable with respect to z. Hence, Q is independent of X. There are many options available to construct such a Q, including the Householder transformation. Set [ ] W = := QX, w W
16 6 W. SU AND E. CANDÈS where w R p and W R n p. The independence between Q and X suggests that W is still a Gaussian random matrix, consisting of i.i.d. N 0, /n entries. Note that X iz = QX i Qz = z QX i = z w i. This implies that S is constructed as the union of S and the k k indices in {,..., p} \ S with the largest w i. Since w and W are independent, we see that both W S and W S are also Gaussian random matrices. These points are crucial in the proof of these two lemmas. Lemma A.. Let k < k < min{n, p} be any deterministic integer. Denote by σ min and σ max, respectively, the smallest and the largest singular value of X S. Then for any t > 0, σ min > /n k /n t holds with probability at least e nt2 /2. Furthermore, σ max < /n + k /n + 8k logp/k /n + t holds with probability at least e nt2 /2 2ek /p k. Proof of Lemma A.. Recall that W S R n k is a Gaussian design with i.i.d. N 0, /n entries. Since W S and X S have the the same set of singular values, we consider W S. Classical theory on Wishart matrices see [5], for example asserts that i all the singular values of W S are larger than /n k /n t with probability at least e nt2 /2, and ii are all smaller than /n + k /n + t with probability at least e nt2 /2. Clearly, all the singular values larger of W S are at least as large as σ min W S. Thus, i yields the first claim. For the other, Lemma A.7 asserts that the event w S 8k logp/k happens with probability at least 2ek /p k. On this event, W S W S 2 + 8k logp/k, where denotes the spectral norm. Hence, ii gives W S W S + 8k logp/k /n+ k /n+t+ 8k logp/k with probability at least e nt2 /2 2ek /p k.
17 SUPPLEMENT 7 Lemma A.2. Denote by b S the solution to the reduced SLOPE problem 4.6 with T = S and λ = λ ɛ. Keep the assumptions from Lemma A.9, and additionally assume k / min{n, p} 0. Then there exists a constant C q only depending on q such that X X S S β S b k S C q log p λ BH n k +,..., λbh p with probability tending to one. Proof of Lemma A.2. In this proof, C is a constant that only depends on q and whose value may change at each occurrence. Rearrange the objective term as X S X S β S b S = X S X S X S X S X S y X S bs X S z = X S Q QX S X S X S X S y X S bs X S z = X S Q ξ, where ξ := QX S X S X S X S y X S bs X S z. For future usage, note that ξ only depends on w and W S and is, therefore, independent of W S. We begin by bounding ξ. It follows from the KKT condition of SLOPE that X S y X S bs is majorized by λ [k ]. Hence, it follows from Fact 3. that A.28 X S y X S bs λ [k ]. Lemma A. with t = /2 gives A.29 X S X S X S /n k /n /2 < 2.0 with probability at least e n/8 for sufficiently large p, where in the last step we have used k /n 0. Hence, from A.28 and A.29 we get ξ X S X S X S X S y X S bs X S z 2.0 λ [k ] + 4 2k logp/k A ɛ C k logp/k = C k logp/k
18 8 W. SU AND E. CANDÈS with probability at least e n/2 2ek /p k e n/8 ; we used Lemma A.7 in the second line and Lemma A.8 in the third. A.30 will help us in finishing the proof. Write A.3 X X S S β S b S = X Q ξ = W ξ = w, 0 ξ+ 0, W S ξ. S S S It follows from Lemma A.9 that w S is majorized by λ BH k +, λbh k +2,..., λbh p / n in probability. As a result, the first term in the right-hand side obeys A.32 w, 0 k ξ = ξ S w ξ w C S S n log p λ BH k k +, λbh k +2,..., λbh p with probability tending to one. For the second term, by exploiting the independence between ξ and W S, we have 0, W S ξ = d ξ ξn 2 ζ,..., ζ p k, n where ζ,..., ζ p k are i.i.d. N 0, /n. Since k /p 0, applying Lemma A.0 gives log p ζ,..., ζ p k C λ BH logp/k k +,..., λbh p with probability approaching one. Hence, owing to A.30, A.33 0, W k S ξ C log p λ BH n k +,..., λbh p holds with probability approaching one. Finally, combining A.32 and A.33 gives that X X S S β S b S = w, 0 ξ + 0, W S ξ S k C n log p k k + log p λbh k n +,..., λbh p k C log p λ BH n k +,..., λbh p holds with probability tending to one.
19 A.4. Proofs for Section 5. SUPPLEMENT 9 Lemma A.3. Keep the assumptions from Lemma 5. and let ζ,..., ζ p be i.i.d. N 0,. Then in probability. # {2 i p : ζ i > τ + ζ } Proof of Lemma A.3. With probability tending to one, τ := τ + ζ also obeys τ / 2 log p and 2 log p τ. This shows that we only need to prove a simpler version of this lemma, namely, # { i p : ζ i > τ} in probability. Put = 2 log p τ = o 2 log p and a = Pξ > τ. Then, # { i p : ζ i > τ} is a binomial random variable with p trials and success probability a. Hence, it suffices to demonstrate that ap. To this end, note that a = Φτ τ 2π e τ2 2 2 log p e log p 2 /2+ 2 log p which gives ap = p 2 log p 2 log p e+o, 2 log p e +o 2 log p. Since in fact, it is sufficient to have bounded away from 0 from below, we have e +o 2 log p / 2 log p, as we wish. Proof of Lemma 5.. For sufficiently large p, 2 ɛ log p ɛ/2τ 2. Hence, it is sufficient to show P π β β 2 ɛ/2τ 2 0 uniformly for all estimators β. Letting I be the random coordinate, β β 2 = j I β 2 j + β I τ 2 = β 2 + τ 2 2τ β I, which is smaller than or equal to ɛ/2τ 2 if and only if β I 2 β 2 + ɛτ 2. 4τ
20 20 W. SU AND E. CANDÈS Denote by A = Ay; β the set of all i {,..., p} such that β i 2 β 2 + ɛτ 2 /4τ, and let b be the minimum value of these β i. Then which gives b 2 β 2 + ɛτ 2 4τ 2 A b 2 + ɛτ 2 4τ 2 2 A b 2 ɛτ 2, 4τ A.34 A 2/ɛ. Recall that β β 2 ɛ/2τ 2 if and only if I is among these A components. Hence, A.35 P π β β 2 ɛ/2τ 2 y = P π I A y = i A P π I = i y = eτy i p, i A i= eτy i where we use the fact that A is almost surely determined by y. Since A.35 is maximal if A is the set of indices with the largest y i s, A.35 and A.34 together yield P π β β 2 ɛ/2τ 2 P π y I = τ + z I is at least the 2/ɛ th largest among y,..., y p 0, where the last step is provided by Lemma A.3. Proof of Lemma 5.3. To closely follow the proof of Lemma 5., denote by A = Ay, X; β the set of all i {,..., p} such that β i 2 β 2 + ɛα 2 τ 2 /4ατ, and keep the same notation b as before. Then β β 2 ɛ/2α 2 τ 2 if and only if I A. Hence, A.36 P π β β 2 ɛ/2α 2 τ 2 y, X = P π I A y, X = i A P π I = i y, X = i A exp ατx i y α2 τ 2 X i 2 /2 p i= exp ατx i y α2 τ 2 X i 2 /2 and this quantity is maximal if A is the set of indices i with the largest values of X i y/α τ X i 2 /2. As shown in Lemma 5., A 2/ɛ, which
21 gives SUPPLEMENT 2 A.37 P π β β 2 ɛ/2α 2 τ 2 P π X Iy/α τ X I 2 /2 is at least the 2/ɛ th largest. We complete the proof by showing that the probability in the right-hand side of A.37 is negligible uniformly over all estimators β as p. By the independence between I and X, z, we can assume I = while evaluating this probability. With this in mind, we aim to show that there are sufficiently many i s such that X iy/α τ 2 X i 2 X y/α+ τ 2 X 2 = X i z/α + τx τ 2 X i 2 X z/α τ 2 X 2 is positive. Since X z/α + τ 2 X 2 = O P /α + τ 2 + OP / n, it suffices to show that A.38 { # 2 i p : X i z/α + τx τ 2 X i 2 > C α + τ 2 + C } 2τ 2/ɛ n holds with vanishing probability for all positive constants C, C 2. By the independence between X i and z/α + τx, we can replace z/α + τx by z/α + τx, 0,..., 0 in A.38. That is, X i z/α + τx τ 2 X i 2 d = z/α + τx X i, τ 2 X2 i, τ 2 X i, 2, where X i, R n is X i without the first entry. To this end, we point out that the following three events all happen with probability tending to one: A.39 #{2 i p : X i, }/p /2, max Xi, 2 2 log p i n, n z/α + τx log p /α. Making use of this and A.38, we only need to show that { N # 2 i 0.49p : nxi, log p/n > τ log p α n nxi, > τ + τ log p # { 2 i 0.49p : log p/n α n + τ 2 + C α + τ 2 + C } 2τ n + C α + C } 2τ n
22 22 W. SU AND E. CANDÈS obeys A.40 N 2/ɛ with vanishing probability. The first line of A.39 shows that there are at least 0.49p many i s such that X i, and we assume they correspond to indices 2 i 0.49p without loss of generality. Note that N is independent of all X i, s. Observe that τ := τ + τlog p/n + C /α + C 2 τ/ n log p = α + 2 τ + O log p/n /α n for sufficiently large p to ensure log p/n is small. Hence, plugging the specific choice of τ and using α, we obtain τ + 2 log p/n τ + O 2 log p log 2 log p + O, which reveals that 2 log0.49p τ = 2 log p τ + o. Since nxn,i are i.i.d. N 0,, Lemma A.3 validates A.40. Proof of Corollary.5. Let c > 0 be a sufficiently small constant to be determined later. It is sufficient to prove the claim with p replaced by a possibly smaller value given by p := min{ cn, p} if we knew that β i = 0 for p + i p, the loss of any estimator X β would not increase after projecting onto the linear space spanned by the first p columns. Hereafter, we assume X R n p and β R p. Observe that p = On implies p = Op and, therefore, A.4 logp /k logp/k. In particular, k/p 0 and n/ logp /k. This suggests that we can apply Theorem 5.4 to our problem, obtaining inf sup P β β 2 β β 0 k 2k logp /k > ɛ. for every constant ɛ > 0. Because of A.4, we also have A.42 inf sup P β β 2 β β 0 k 2k logp/k > ɛ for any ɛ > 0.
23 SUPPLEMENT 23 Since p /n c, the smallest singular value of the Gaussian random matrix X is at least c + o P see, for example, [5]. This result, together with A.42, yields inf β sup P β 0 k X β Xβ 2 2k logp/k > c 2 ɛ for each ɛ > 0. Finally, choose c and ɛ sufficiently small such that c 2 ɛ > ɛ. REFERENCES [] M. Abramowitz and I. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Courier Corporation, 202. [2] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. SLOPE adaptive variable selection via convex optimization. arxiv preprint arxiv: , 204. [3] S. Boucheron and M. Thomas. Concentration inequalities for order statistics. Electronic Communications in Probability, 75: 2, 202. [4] L. de Haan and A. Ferreira. Extreme value theory: An introduction. Springer Science & Business Media, [5] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications, 202. Departments of Statistics Stanford University Stanford, CA USA wjsu@stanford.edu Departments of Statistics and Mathematics Stanford University Stanford, CA USA candes@stanford.edu
Optimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSolution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions
Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,
More information19.1 Problem setup: Sparse linear regression
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In
More informationNORMS ON SPACE OF MATRICES
NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationStatistical Estimation and Testing via the Sorted l 1 Norm
Statistical Estimation and Testing via the Sorted l Norm Ma lgorzata Bogdan a Ewout van den Berg b Weijie Su c Emmanuel J. Candès c,d a Departments of Mathematics and Computer Science, Wroc law University
More informationRandom regular digraphs: singularity and spectrum
Random regular digraphs: singularity and spectrum Nick Cook, UCLA Probability Seminar, Stanford University November 2, 2015 Universality Circular law Singularity probability Talk outline 1 Universality
More informationLecture Notes 9: Constrained Optimization
Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form
More informationConstrained optimization
Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained
More informationSUPPLEMENT TO FALSE DISCOVERIES OCCUR EARLY ON THE LASSO PATH
The Annals of Statistics arxiv: arxiv:1511.01957 SUPPLEMENT TO FALSE DISCOVERIES OCCUR EARLY ON THE LASSO PATH By Weijie Su and Ma lgorzata Bogdan and Emmanuel Candès University of Pennsylvania, University
More informationP i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=
2.7. Recurrence and transience Consider a Markov chain {X n : n N 0 } on state space E with transition matrix P. Definition 2.7.1. A state i E is called recurrent if P i [X n = i for infinitely many n]
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationCompressed Sensing and Sparse Recovery
ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing
More informationA Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases
2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationStanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures
2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon
More informationRandom matrices: Distribution of the least singular value (via Property Testing)
Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu Department of Mathematics Rutgers vanvu@math.rutgers.edu (joint work with T. Tao, UCLA) 1 Let ξ be a real or complex-valued
More information3 Integration and Expectation
3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ
More informationFalse Discovery Rate
False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR
More informationEstimates for probabilities of independent events and infinite series
Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences
More informationSupplementary Materials to Convex Banding of the Covariance Matrix
Supplementary Materials to Convex Banding of the Covariance Matrix Jacob Bien, Florentina Bunea, Luo Xiao May 25, 2015 A.1 Dual problem Define LΣ, A 1,..., A = 1 2 S Σ 2 F + λ W l A l, Σ. Observe that
More informationWeak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij
Weak convergence and Brownian Motion (telegram style notes) P.J.C. Spreij this version: December 8, 2006 1 The space C[0, ) In this section we summarize some facts concerning the space C[0, ) of real
More informationABELIAN SELF-COMMUTATORS IN FINITE FACTORS
ABELIAN SELF-COMMUTATORS IN FINITE FACTORS GABRIEL NAGY Abstract. An abelian self-commutator in a C*-algebra A is an element of the form A = X X XX, with X A, such that X X and XX commute. It is shown
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationPOISSON PROCESSES 1. THE LAW OF SMALL NUMBERS
POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS 1.1. The Rutherford-Chadwick-Ellis Experiment. About 90 years ago Ernest Rutherford and his collaborators at the Cavendish Laboratory in Cambridge conducted
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationOptimisation Combinatoire et Convexe.
Optimisation Combinatoire et Convexe. Low complexity models, l 1 penalties. A. d Aspremont. M1 ENS. 1/36 Today Sparsity, low complexity models. l 1 -recovery results: three approaches. Extensions: matrix
More informationInference for High Dimensional Robust Regression
Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More information4 Sums of Independent Random Variables
4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables
More informationRandom Bernstein-Markov factors
Random Bernstein-Markov factors Igor Pritsker and Koushik Ramachandran October 20, 208 Abstract For a polynomial P n of degree n, Bernstein s inequality states that P n n P n for all L p norms on the unit
More informationC.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series
C.7 Numerical series Pag. 147 Proof of the converging criteria for series Theorem 5.29 (Comparison test) Let and be positive-term series such that 0, for any k 0. i) If the series converges, then also
More informationLecture 21 Representations of Martingales
Lecture 21: Representations of Martingales 1 of 11 Course: Theory of Probability II Term: Spring 215 Instructor: Gordan Zitkovic Lecture 21 Representations of Martingales Right-continuous inverses Let
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationCHAPTER III THE PROOF OF INEQUALITIES
CHAPTER III THE PROOF OF INEQUALITIES In this Chapter, the main purpose is to prove four theorems about Hardy-Littlewood- Pólya Inequality and then gives some examples of their application. We will begin
More informationBALANCING GAUSSIAN VECTORS. 1. Introduction
BALANCING GAUSSIAN VECTORS KEVIN P. COSTELLO Abstract. Let x 1,... x n be independent normally distributed vectors on R d. We determine the distribution function of the minimum norm of the 2 n vectors
More informationMATH 117 LECTURE NOTES
MATH 117 LECTURE NOTES XIN ZHOU Abstract. This is the set of lecture notes for Math 117 during Fall quarter of 2017 at UC Santa Barbara. The lectures follow closely the textbook [1]. Contents 1. The set
More informationThe Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression
The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationOn a class of stochastic differential equations in a financial network model
1 On a class of stochastic differential equations in a financial network model Tomoyuki Ichiba Department of Statistics & Applied Probability, Center for Financial Mathematics and Actuarial Research, University
More informationAn introduction to some aspects of functional analysis
An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms
More informationTrace Class Operators and Lidskii s Theorem
Trace Class Operators and Lidskii s Theorem Tom Phelan Semester 2 2009 1 Introduction The purpose of this paper is to provide the reader with a self-contained derivation of the celebrated Lidskii Trace
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationX n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)
14:17 11/16/2 TOPIC. Convergence in distribution and related notions. This section studies the notion of the so-called convergence in distribution of real random variables. This is the kind of convergence
More informationA Randomized Algorithm for the Approximation of Matrices
A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive
More informationLeast singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA)
Least singular value of random matrices Lewis Memorial Lecture / DIMACS minicourse March 18, 2008 Terence Tao (UCLA) 1 Extreme singular values Let M = (a ij ) 1 i n;1 j m be a square or rectangular matrix
More informationDynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)
Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Matija Vidmar February 7, 2018 1 Dynkin and π-systems Some basic
More informationLinear Programming. Larry Blume Cornell University, IHS Vienna and SFI. Summer 2016
Linear Programming Larry Blume Cornell University, IHS Vienna and SFI Summer 2016 These notes derive basic results in finite-dimensional linear programming using tools of convex analysis. Most sources
More informationON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS
Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationMATH 205C: STATIONARY PHASE LEMMA
MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationAsymptotic behavior of infinity harmonic functions near an isolated singularity
Asymptotic behavior of infinity harmonic functions near an isolated singularity Ovidiu Savin, Changyou Wang, Yifeng Yu Abstract In this paper, we prove if n 2 x 0 is an isolated singularity of a nonegative
More informationOPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS
OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationUNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems
UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction
More informationBrownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539
Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory
More informationFundamental Domains, Lattice Density, and Minkowski Theorems
New York University, Fall 2013 Lattices, Convexity & Algorithms Lecture 3 Fundamental Domains, Lattice Density, and Minkowski Theorems Lecturers: D. Dadush, O. Regev Scribe: D. Dadush 1 Fundamental Parallelepiped
More informationSPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS
SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationBranch-and-cut Approaches for Chance-constrained Formulations of Reliable Network Design Problems
Branch-and-cut Approaches for Chance-constrained Formulations of Reliable Network Design Problems Yongjia Song James R. Luedtke August 9, 2012 Abstract We study solution approaches for the design of reliably
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationLecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1
Random Walks and Brownian Motion Tel Aviv University Spring 011 Lecture date: May 0, 011 Lecture 9 Instructor: Ron Peled Scribe: Jonathan Hermon In today s lecture we present the Brownian motion (BM).
More informationOn the smoothness of the conjugacy between circle maps with a break
On the smoothness of the conjugacy between circle maps with a break Konstantin Khanin and Saša Kocić 2 Department of Mathematics, University of Toronto, Toronto, ON, Canada M5S 2E4 2 Department of Mathematics,
More informationRANDOM WALKS WHOSE CONCAVE MAJORANTS OFTEN HAVE FEW FACES
RANDOM WALKS WHOSE CONCAVE MAJORANTS OFTEN HAVE FEW FACES ZHIHUA QIAO and J. MICHAEL STEELE Abstract. We construct a continuous distribution G such that the number of faces in the smallest concave majorant
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationOptimization Theory. A Concise Introduction. Jiongmin Yong
October 11, 017 16:5 ws-book9x6 Book Title Optimization Theory 017-08-Lecture Notes page 1 1 Optimization Theory A Concise Introduction Jiongmin Yong Optimization Theory 017-08-Lecture Notes page Optimization
More informationIf Y and Y 0 satisfy (1-2), then Y = Y 0 a.s.
20 6. CONDITIONAL EXPECTATION Having discussed at length the limit theory for sums of independent random variables we will now move on to deal with dependent random variables. An important tool in this
More informationErdős-Renyi random graphs basics
Erdős-Renyi random graphs basics Nathanaël Berestycki U.B.C. - class on percolation We take n vertices and a number p = p(n) with < p < 1. Let G(n, p(n)) be the graph such that there is an edge between
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY Uplink Downlink Duality Via Minimax Duality. Wei Yu, Member, IEEE (1) (2)
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY 2006 361 Uplink Downlink Duality Via Minimax Duality Wei Yu, Member, IEEE Abstract The sum capacity of a Gaussian vector broadcast channel
More informationOptimal Power Control in Decentralized Gaussian Multiple Access Channels
1 Optimal Power Control in Decentralized Gaussian Multiple Access Channels Kamal Singh Department of Electrical Engineering Indian Institute of Technology Bombay. arxiv:1711.08272v1 [eess.sp] 21 Nov 2017
More informationTHROUGHOUT this paper, we let C be a nonempty
Strong Convergence Theorems of Multivalued Nonexpansive Mappings and Maximal Monotone Operators in Banach Spaces Kriengsak Wattanawitoon, Uamporn Witthayarat and Poom Kumam Abstract In this paper, we prove
More informationHILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define
HILBERT SPACES AND THE RADON-NIKODYM THEOREM STEVEN P. LALLEY 1. DEFINITIONS Definition 1. A real inner product space is a real vector space V together with a symmetric, bilinear, positive-definite mapping,
More informationSupplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data
Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department
More informationAsymptotics of minimax stochastic programs
Asymptotics of minimax stochastic programs Alexander Shapiro Abstract. We discuss in this paper asymptotics of the sample average approximation (SAA) of the optimal value of a minimax stochastic programming
More informationConcentration Inequalities
Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does
More informationLasso Screening Rules via Dual Polytope Projection
Journal of Machine Learning Research 6 (5) 63- Submitted /4; Revised 9/4; Published 5/5 Lasso Screening Rules via Dual Polytope Projection Jie Wang Department of Computational Medicine and Bioinformatics
More informationInformation-Theoretic Limits of Matrix Completion
Information-Theoretic Limits of Matrix Completion Erwin Riegler, David Stotz, and Helmut Bölcskei Dept. IT & EE, ETH Zurich, Switzerland Email: {eriegler, dstotz, boelcskei}@nari.ee.ethz.ch Abstract We
More informationAppendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)
Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem
More informationarxiv: v1 [math.st] 26 Jun 2011
The Shape of the Noncentral χ 2 Density arxiv:1106.5241v1 [math.st] 26 Jun 2011 Yaming Yu Department of Statistics University of California Irvine, CA 92697, USA yamingy@uci.edu Abstract A noncentral χ
More informationFALSE DISCOVERIES OCCUR EARLY ON THE LASSO PATH. Weijie Su Małgorzata Bogdan Emmanuel Candès
FALSE DISCOVERIES OCCUR EARLY ON THE LASSO PATH By Weijie Su Małgorzata Bogdan Emmanuel Candès Technical Report No. 2015-20 November 2015 Department of Statistics STANFORD UNIVERSITY Stanford, California
More informationZ Algorithmic Superpower Randomization October 15th, Lecture 12
15.859-Z Algorithmic Superpower Randomization October 15th, 014 Lecture 1 Lecturer: Bernhard Haeupler Scribe: Goran Žužić Today s lecture is about finding sparse solutions to linear systems. The problem
More informationSupremum of simple stochastic processes
Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there
More informationTHE DVORETZKY KIEFER WOLFOWITZ INEQUALITY WITH SHARP CONSTANT: MASSART S 1990 PROOF SEMINAR, SEPT. 28, R. M. Dudley
THE DVORETZKY KIEFER WOLFOWITZ INEQUALITY WITH SHARP CONSTANT: MASSART S 1990 PROOF SEMINAR, SEPT. 28, 2011 R. M. Dudley 1 A. Dvoretzky, J. Kiefer, and J. Wolfowitz 1956 proved the Dvoretzky Kiefer Wolfowitz
More informationOptimality Conditions for Constrained Optimization
72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationConvex Optimization Notes
Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =
More informationOn Backward Product of Stochastic Matrices
On Backward Product of Stochastic Matrices Behrouz Touri and Angelia Nedić 1 Abstract We study the ergodicity of backward product of stochastic and doubly stochastic matrices by introducing the concept
More informationDuality of multiparameter Hardy spaces H p on spaces of homogeneous type
Duality of multiparameter Hardy spaces H p on spaces of homogeneous type Yongsheng Han, Ji Li, and Guozhen Lu Department of Mathematics Vanderbilt University Nashville, TN Internet Analysis Seminar 2012
More informationThe Codimension of the Zeros of a Stable Process in Random Scenery
The Codimension of the Zeros of a Stable Process in Random Scenery Davar Khoshnevisan The University of Utah, Department of Mathematics Salt Lake City, UT 84105 0090, U.S.A. davar@math.utah.edu http://www.math.utah.edu/~davar
More informationChapter 1. Preliminaries
Introduction This dissertation is a reading of chapter 4 in part I of the book : Integer and Combinatorial Optimization by George L. Nemhauser & Laurence A. Wolsey. The chapter elaborates links between
More informationMATH 426, TOPOLOGY. p 1.
MATH 426, TOPOLOGY THE p-norms In this document we assume an extended real line, where is an element greater than all real numbers; the interval notation [1, ] will be used to mean [1, ) { }. 1. THE p
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationStability and Robustness of Weak Orthogonal Matching Pursuits
Stability and Robustness of Weak Orthogonal Matching Pursuits Simon Foucart, Drexel University Abstract A recent result establishing, under restricted isometry conditions, the success of sparse recovery
More informationAgenda. Applications of semidefinite programming. 1 Control and system theory. 2 Combinatorial and nonconvex optimization
Agenda Applications of semidefinite programming 1 Control and system theory 2 Combinatorial and nonconvex optimization 3 Spectral estimation & super-resolution Control and system theory SDP in wide use
More informationTheorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1
Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be
More informationNotes 6 : First and second moment methods
Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative
More information