arxiv: v1 [stat.ml] 5 Jan 2015
|
|
- Elmer Jackson
- 5 years ago
- Views:
Transcription
1 To appear on the Annals of Statistics SUPPLEMENTARY MATERIALS FOR INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION arxiv: v1 [stat.ml] 5 Jan 015 By Yingying Fan, Yinfei Kong, Daoi Li and Zemin Zheng University of Southern California APPENDIX A: PROOFS FOR PROPOSITION 1 AND MAIN LEMMAS In this section, we prove all key Lemmas in the order they appear in the main text. Additional Lemmas and their proofs are provided in Appendix B. Deviation of sub-gaussian distribution. Recall that a random vector w = (W 1,,W p ) T R p is sub-gaussian if there exist some positive constants a and b such that P( v T w > t) aexp( bt ) for any t > 0 and any vector v R p satisfying v = 1. Suppose w = (W 1,,W p ) is sub-gaussian with constants a, b, mean µ and covariance matrix Σ. Let w 1,,w n be n independent copies of w. Since w is sub-gaussian, by Lemma 10, we have w µ is also sub-gaussian. Then there exists some constants ã and b such that P( v T (w i µ)(w i µ) T v > x) ãexp( bx) for all x > 0 and v = 1, which implies that E{exp[tv T (w i µ)(w i µ) T v]} < for all 0 < t < b and v = 1. Similar as in the proof of Lemma 3 in [1], we know that there exist some constants C > 0 and ρ > 0 depending on ã and b, such that for all 0 < x < ρ and any unit vector v = 1, (A.1) P{ (1/n) v T (w i µ)(w i µ) T v v T Σv > x} Ce nxρ/. 1
2 Y. FAN, Y. KONG, D. LI AND Z. ZHENG A.1. Proof of Proposition 1. The main idea is to prove that our test statistic D converges to its population counterpart D uniformly over. Since Condition 3 ensures that D is bounded from below for A 1 and D = 0 for A c 1, the uniform convergence of D will imply the results in Proposition 1. We proceed to prove the uniform convergence of D to D. To this end, we decompose the difference between them into the following five terms: (A.) D D log( σ /σ ) log [ (1) +π ( σ ) /(σ (1) ) ] log [ () +(1 π) ( σ ) /(σ () ) ] + n 1 /n π log [ ( σ (1) ) ] + n /n (1 π) log [ ( σ () ) ]. We will establish successively the deviation bounds of the terms on the right hand side above. The same notation C will be used to denote a generic constant without loss of generality. By Lemma 7, the estimators σ converge to σ uniformly over all = 1,,p, with probability at least 1 pexp( C τ 1,p n1 κ ). Define this event as E. We will condition on the event E hereafter. Since x 1 n log(1+x n) 1 as x n 0, it follows that log( σ /σ )/( σ /σ 1) 1 uniformlyfor all as n. Thus,uniformlyover all = 1,,p, with sufficiently large n, we have the following bound for the first term log( σ /σ ) on the right hand side of (A.) (A.3) P( log( σ /σ ) > 4 1 cn κ E) P( σ /σ 1 > 8 1 cn κ E), where constants c and κ are defined in Condition 3. Then (A.3) together with (A.7) in the proof of Lemma 7 entails that (A.4) P( log( σ /σ ) > 4 1 cn κ E) exp( C τ 1,p n1 κ ). Using the same arguments, for either k = 1 or, we can prove that (A.5) P( log[( σ (k) ) /(σ (k) ) ] > 4 1 cn κ E) exp( C τ 1,p n1 κ ). By the proof of Lemma 6, we know that τ,p πλ max (Ω 1 )+(1 π)λ max (Ω 1 Σ Ω 1 ). Since ( σ (1) ) and ( σ () ) can be bounded from above by λ max (Ω 1 ) and λ max (Ω 1 Σ Ω 1 ), respectively, we know that ( σ (k) ) can be bounded from
3 INNOVATED INTERACTION SCREENING 3 above by π 1 (1 π) 1 τ,p for k = 1,. As n 1 = n i, by Hoeffding s inequality we get (A.6) P( n 1 /n π log[( σ (1) ) ] > 4 1 cn κ E) P( 1 cn κ i π > n 4 log[π 1 (1 π) 1 τ,p ] E) nc n κ exp( 16log [π 1 (1 π) 1 τ,p ] ) exp{ Cn1 κ /log ( τ,p )}. Similarly, we have (A.7) P( n /n (1 π) log[( σ () ) ] > 4 1 cn κ E) exp{ Cn 1 κ /log ( τ,p )}. In view of (A.), we have (A.8) P( D D > cn κ E) P( log( σ /σ ) > 4 1 cn κ E)+P( log[( σ (1) ) /(σ (1) ) ] > 4 1 cn κ E) + P( log[( σ (1) ) /(σ (1) ) ] > 4 1 cn κ E)+P( n 1 n π log[( σ(1) ) ] > 4 1 cn κ E) + P( n n (1 π) log[( σ() ) ] > 4 1 cn κ E). Combining the probability bounds in (A.4), (A.5), (A.6) and (A.7) gives P( D D > cn κ E) 3exp( C τ 1,p n1 κ )+exp{ Cn 1 κ /log ( τ,p )} It follows that exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}. P( max 1 p D D > cn κ E) p P( D D > cn κ E) =1 pexp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}, which is the deviation of the statistic D from its population counterpart D. By Lemma 7, P(E c ) pexp( C τ 1,p n1 κ ). Thus we have P(max 1 p D D > cn κ ) P(max 1 p D D > cn κ E)+P(E c ) pexp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}.
4 4 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Therefore, for any p satisfying logp = O(n γ ) with γ > 0, γ +κ < 1 and τ 1,p +log ( τ,p ) = o(n 1 κ γ ), it follows that P( max 1 p D D > cn κ ) exp{n γ Cn 1 κ /[ τ 1,p +log ( τ,p )]} exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}. By Condition 3 and its discussion, we know that D 3cn κ when A 1, and D = 0 otherwise. It follows that {min D < cn κ } {max D > cn κ } { max D D > cn κ }, A 1 A c 1 {1,,p} whichshowsthatwithprobabilityatleast1 exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}, min D cn κ and max D cn κ A 1 A c 1 for sufficiently large n. As the same conditions hold for the covariance matrix Σ and the data after the second transformation, the results above also apply to the covariates in A with the test statistics calculated based on data transformed by Ω. This completes the proof of Proposition 1. A.. Lemma 1 and its proof. Lemma 1. Under model setting () and conditions in Theorem 1, for sufficiently large n, with probability at least 1 pexp( C τ 1,p n1 κ ), it holds that max 1 p ˆσ / σ 1 T n,p/6 for some positive constant C. Proof. Let Z = 1 n ( z 1,, z p ) with z = n z i/n for the original data matrix without transformation. We will first bound the term ˆσ σ by writing it as ˆσ σ = e T [ Ω 1 (Z Z) T (Z Z) Ω 1 Ω 1 (Z Z) T (Z Z)Ω 1 ]e /n. Through a further decomposition, it gives ˆσ σ = et ( Ω 1 Ω 1 )(Z Z) T (Z Z)( Ω 1 Ω 1 )e /n (A.9) +e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)Ω 1 e /n.
5 INNOVATED INTERACTION SCREENING 5 We will then bound the two terms on the right hand side above separately. Recall that max denotes the componentwise infinity norm for a matrix. For the first term, we have e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)( Ω 1 Ω 1 )e /n (Z Z) T (Z Z) max ( Ω 1 Ω 1 )e 1 /n (Z Z) T (Z Z) max [(K p +K p) C 1 K p (logp)/n] /n = ( (Z Z) T (Z Z) max /n) [(K p +K p ) C 1 K4 p (logp)/n], where the second inequality follows from the definition of acceptable estimator and the fact that Ω 1 Ω 1 is (K p +K p )-sparse. For the second term, similarly we get e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)Ω 1 e /n (Z Z) T (Z Z) max ( Ω 1 Ω 1 )e 1 Ω 1 e 1 /n ( (Z Z) T (Z Z) max /n) [(K p +K p )C 1K p (logp)/n] Kp Ω 1 max. Since Ω 1 max is assumed to be upper bounded in Condition 4, and the above two bounds are independent of the index, in view of (A.9), we know that there exists some constant C such that (A.10) max 1 p ˆσ σ C( (Z Z) T (Z Z) max /n)[(k p +K p )K3 p (logp)/n] max{(k p +K p)k p (logp)/n,1}. This together with the definition of T n,p before Theorem 1 ensures that max 1 p ˆσ σ C( (Z Z) T (Z Z) max /n) T n,p τ 1,p /( C 1 τ,p ). By Lemma 6, σ are uniformly bounded from below by τ 1,p. By Lemma 7, max 1 p σ /σ 1 cn κ /8. Combining these two results entails that for n large enough, σ (1 cn κ /8)σ > τ 1,p/, uniformly for all, with probability at least 1 pexp( C τ 1,p n1 κ ). Denote by E the event that the results in Lemma 7 hold. By Lemma 8, when C 1 1c 3 C, we have (A.11) P(max 1 p ˆσ / σ 1 > T n,p/6 E) P(max 1 p ˆσ σ > T n,p τ 1,p /1 E) P( (Z Z) T (Z Z) max /n > c 3 τ,p E) p exp( C n).
6 6 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Under the conditions in Theorem 1, we have logp = o(n γ ) with γ > 0 and γ +κ < 1. It follows that P( max 1 p ˆσ / σ 1 > T n,p /6) P( max 1 p ˆσ / σ 1 > T n,p /6 E)+P(E c ) p exp( C n)+pexp( C τ 1,pn 1 κ ) = pexp( C τ 1,pn 1 κ ), where we use the same notation C here to denote a generic constant without lost of generality. This completes the proof of Lemma 1. A.3. Lemma and its proof. Lemma. Under Condition 5, we have (A.1) Cn 1 Xδ +pen( θ) n 1 ε T X δ 1 +pen(θ 0 ), where C is some positive constant depending on the positive constant π min in Condition 6, and δ = θ θ 0 is the estimation error for the regularized estimator θ defined in (18), and ε = y E(y X) with y = ( 1,, n ) T. Proof. Definel n (θ) = n 1 n l n(x T i θ, i)wherel n (x T θ, ) = x T θ+ log[1+exp(x T θ)]. Then, in matrix form, l n (θ) can be rewritten as l n (θ) = n 1 {y T Xθ 1 T b(xθ)}, where y = ( 1,, n ) T is an n-dimensional response vector with i {0,1}, X = (x 1,,x n ) T = ( x 1,, x p ) is an n p augmented design matrix, 1 is an n-dimensional vector with each component being one, b(β) = (b(β 1 ),,b(β n )) T is a vector-valued function with β i = x T i θ and b(u) = log[1+exp(u)]. By the definition of θ, we have l n (θ)+pen(θ) l n (θ 0 )+pen(θ 0 ) where θ 0 is the true regression coefficient vector of θ. Rearranging terms yields (A.13) n 1 1 T [b(x θ) b(xθ 0 )]+pen( θ) n 1 y T Xδ +pen(θ 0 ), where δ = θ θ 0 is the estimation error. Applying Taylor expansion to the function of 1 T b(x θ) at θ 0 gives (A.14) 1 T [b(x θ) b(xθ 0 )] = [b (Xθ 0 )] T Xδ + 1 δ T X T HXδ, where H = H(X, θ 1,, θ n ) = diag{b (x T 1 θ 1 ),,b (x T n θ n )} is an n n diagonal matrix with θ i R p lying on the line segment adoining θ and θ 0, i = 1,,n. Combining (A.13) and (A.14) and rearranging terms yield (A.15) (n) 1 δ T X T HXδ +pen( θ) n 1 ε T Xδ +pen(θ 0 ),
7 INNOVATED INTERACTION SCREENING 7 where ε = y E(y X) = y b (Xθ 0 ). The right hand side of the above inequality can be bounded as (A.16) n 1 ε T Xδ +pen(θ 0 ) n 1 ε T X δ 1 +pen(θ 0 ). By Condition 5, we have δ T X T HXδ C Xδ for some positive constant C, which depends on the constant π min in Condition 5. This inequality, together with (A.15) and (A.16), completes the proof. A.4. Lemma 3 and its proof. Lemma 3. Assume that Condition 1 holds. If log(p) = o(n), then with probability 1 O(p c 1 ), we have n 1 ε T X 1 c 0 log(p)/n, where c0 is some positive constant and ε = y E(y X) with y = ( 1,, n ) T. Proof. Recall that X = (x 1,,x n ) T = ( x 1,, x p ). An application of the Bonferroni inequality gives that (A.17) P( n 1 ε T X > λ 0 ) p =1 P( n 1 ε T x > λ 0 ) for any λ 0. The key idea is to bound P( n 1 ε T x > λ 0 ). To this end, consider the following three cases. Case 1: = 1. In this case, x = 1, where 1 is a n-dimensional vector with each component being one. Recall that ε = (ε 1,,ε n ) T with ε i = i E( i x i ) and i {0,1}. So 1 ε i 1 for i = 1,,n. Thus, by Hoeffding s inequality [3], we have P( n 1 ε T x > λ 0 ) = P( n 1 ε T 1 > λ 0 ) exp ( nλ 0 /). Case : p + 1. In this case, x = (Z 1, 1,,Z n, 1 ) T. Thus, n 1 ε T x = n 1 n ε iz i, 1. From Lemma 9, under Condition 1, we have that z = (Z 1,,Z p ) T is sub-gaussian, that is, there exist some positive constants a 1 and b 1 such that P( v T z > t) a 1 exp( b 1 t )
8 8 Y. FAN, Y. KONG, D. LI AND Z. ZHENG for any vector v R p satisfying v = 1 and any t > 0. Therefore, for any p + 1, taking v = e 1 being a unit vector with the ( 1) component being one and zero elsewhere in the inequality above gives P( Z 1 > t) a 1 exp( b 1 t ) holds uniformly for all p+1. By Lemma 11, we have E(e b 1Z 1 / ) 1+a 1. This, together with the inequality ab (a +b )/ for any a,b > 0, gives e (b 1/) ε i Z i, 1 e b 1(ε i +Z i, 1 )/4 = e b 1ε i /4 e b 1Z i, 1 /4 (A.18) (e b 1ε i / +e b 1Z i, 1 / )/ (e b 1/ +1+a 1 )/ for all 1 i n and p + 1, where we have used the fact that 1 ε i 1 in the last inequality. Thus, it follows from Lemma 1 that for any 0 < λ 0 1, there exist some positive constants a and b such that P( n 1 ε T x λ 0 ) = P( n 1 ε i Z i, 1 λ 0 ) a exp( b nλ 0) for all p+1. Case 3: p + p. In this case, x = (Z 1,k Z 1,l,,Z n,k Z n,l ) T for some 1 k l p. We can use similar arguments for Case to bound P( n 1 ε T x λ 0 ). Similarly to (A.18), we have e (b 1/) ε i Z ik Z il e (b 1/) Z ik Z il e b 1(Z ik +Z il )/4 = e b 1Z ik /4 e b 1Z il /4 (e b 1Z ik / +e b 1Z il / )/ 1+a 1, for all 1 i n and all 1 k l p. This together with Lemma 1 gives that for any 0 < λ 0 1, there exist some positive constants a 3 and b 3 such that P( n 1 ε T x λ 0 ) = P( n 1 ε i Z ik Z il λ 0 ) a 3 exp( b 3 nλ 0 ) for all p+ p. Combining Cases 1-3 above yields (A.19) P( n 1 ε T x λ 0 ) a 4 exp( b 4 nλ 0) for any 0 λ 0 1, wherea 4 = max{,a,a 3 } and b 4 = min{1/,b,b 3 }. Let λ 0 = 1 c 0 log(p)/n with some positive constant c0. Since log(p) = o(n),
9 INNOVATED INTERACTION SCREENING 9 we have 0 < λ 0 1 for all sufficiently large n. In view of (A.17) and (A.19), we have P( n 1 ε T X > 1 c 0 log(p)/n) pa4 exp{ 4 1 b 4 c 0log(p)} 3a 4 p (4 1 b 4 c 0 ), wherec 0 > 8/b 4 andwehaveusedthefactthat p = 1+p+p(p+1)/ 3p. Thus, we conclude that n 1 ε T X c 0 log(p)/n holds with probability at least 1 O(p c 1 ) with c 1 = 4 1 b 4 c 0 > 0. A.5. Lemma 4 and its proof. Lemma 4. Assume that there exists some constant φ > 0 such that (A.0) δ T Σδ φ δ T Sδ S for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S, where Σ = E(x T x). If both z (1) and z () are sub-gaussian, 5s 1/ + 4λ 1 1 λ θ 0 = O(n ξ/ ), and log(p) = o(n 1/ ξ ) with constant 0 ξ < 1/4, then with probability at least 1 O(p c ), n 1/ Xδ (φ/) δ S holds for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S when n is sufficiently large. Proof. The idea is to show that the desired inequality holds conditioning on an event and the probability of this event occurring is at most O(p c ) with some positive constant c. Conditioning the event E 4 = { n 1 X T X Σ } < C 1 n ξ where positive constant C 1 will be specified later, we have δ T (n 1 X T X Σ)δ < C 1 n ξ δ 1 = C 1n ξ ( δ S 1 + δ S c 1 ) C 1 n ξ (s 1/ δ S + δ S c 1 ), where the last inequality follows from the Cauchy-Schwarz inequality. The following arguments are all conditioning on the event E 4. Thus, for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S, we obtain δ T (n 1 X T X Σ)δ C 1 n ξ (5s 1/ +4λ 1 1 λ θ 0 ) δ S.
10 10 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Since 5s 1/ + 4λ 1 1 λ θ 0 = O(n ξ/ ), there exists some positive constant C such that 5s 1/ +4λ 1 1 λ θ 0 C n ξ/. Thus, δ T (n 1 X T X Σ)δ C 1 C δ S for any δ R p satisfying δ S c 1 4(s 1/ + λ 1 1 λ θ 0 ) δ S. Note that n 1 δ T X T Xδ = δ T (n 1 X T X Σ)δ+δ T Σδ. This, together with the above inequality and the assumption (A.0), yields n 1 δ T X T Xδ C 1 C δ S +δ T Σδ (φ C 1 C ) δ S. Choose C 1 = 3φ /(4C ). Thus, with probability 1 P(Ec 4 ), we have n 1/ Xδ (φ/) δ S for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S. It remains to show that P(E4 c) O(p c ) with some positive constant c. For any matrix A, denote by A the entrywise matrix infinity norm of A and (A) kl the (k,l) entry of A. Since both z (1) and z () are sub-gaussian, by Lemmas 9, 10, 11, and 13, we have P(E4) c = P { n 1 X T X Σ K /(C)n ξ} p p k=1 l=1 P{ (n 1 X T X Σ) kl K /(C )n ξ } p aexp( 4 1 bc K n1/ ξ ) O(p c ) for all n sufficiently large, where a,b, c are positive constants and the last inequality holds since p = 1+p+p(p+1)/ and log(p) = o(n 1/ ξ ). This completes the proof of Lemma 4. A.6. Lemma 5 and its proof. Lemma 5. Assume that w = (W 1,,W p ) T R p is sub-gaussian. Then for any positive constant c 1, there exists some positive constant C such that { } P max W > C log(p) = O(p c 1 ). 1 p Proof. Since w = (W 1,,W p ) T is sub-gaussian, there exist some positive constants c 1 and c such that P( v T w > t) c 1 exp( c t ) for any
11 INNOVATED INTERACTION SCREENING 11 v R p satisfying v = 1 and any t > 0. Taking v be a unit vector with th component 1 and all other components 0 yields P( W > t) c 1 exp( c t ) for all 1 p. An application of the Bonferroni inequality gives P( max 1 p W > t) p P( W > t) p c 1 exp( c t ). =1 If we choose t = C log(p) with some positive constant C = 1+c 1 / c, } then we have P {max 1 p W > C log(p) = O(p c 1 ). This completes the proof of Lemma 5. APPENDIX B: PROOFS FOR SECONARY LEMMAS B.1. Lemma 6 and its proof. Lemma 6. Under Condition, it holds that τ 1,p λ min [cov( z)] λ max [cov( z)] τ,p, where τ 1,p = {πτ 1,p +(1 π)τ 1τ,p } 1 and τ,p = {πτ 1 1 +(1 π)τ 1 τ,p+ π(1 π)τ 1 µ 1 } exp(1). Proof. In order to prove Lemma 6, we will first calculate the covariance matrix of z = z (1) + (1 )z (). Recall that µ = E(z () ) = 0 and is a Bernoulli variable taking value 1 with probability π. We will apply the formula cov(z) = cov[e(z )] + E[cov(z )] to calculate the covariance matrix of z. For the first term cov[e(z )], we can calculate it as E(z ) = E(z (1) )+(1 )E(z () ) = µ 1, which gives cov[e(z )] = cov( )µ 1 µ T 1 = π(1 π)µ 1µ T 1. For the second term E[cov(z )], since =, (1 ) = 1 and (1 ) = 0, we have cov(z ) = E(zz T ) E(z )E(z ) T = {E(z (1) z (1)T ) µ 1 µ T 1}+(1 ) E(z () z ()T ) = Σ 1 +(1 )Σ.
12 1 Y. FAN, Y. KONG, D. LI AND Z. ZHENG After taking expectation on both sides, we get E[cov(z )] = πσ 1 +(1 π)σ. Thus we have cov(z) = πσ 1 +(1 π)σ +π(1 π)µ 1 µ T 1. Recall that z = Ω 1 z. It follows that cov( z) = Ω 1 cov(z)ω 1 = πω 1 +(1 π)ω 1 Σ Ω 1 +π(1 π)ω 1 µ 1 µ T 1Ω 1. Therefore, under Condition, we get λ max [cov( z)] πλ max (Ω 1 )+(1 π)λ max (Ω 1 Σ Ω 1 )+π(1 π)λ max (Ω 1 µ 1 µ T 1 Ω 1) πτ 1 1 +(1 π)τ 1 τ,p +π(1 π)τ 1 µ 1 ; λ min [cov( z)] πλ min (Ω 1 )+(1 π)λ min (Ω 1 Σ Ω 1 )+π(1 π)λ min (Ω 1 µ 1 µ T 1 Ω 1) πτ 1,p +(1 π)τ 1τ,p. It completes the proof of Lemma 6. B.. Lemma 7 and its proof. Lemma 7. Under model setting () and conditions in Proposition 1, for sufficiently large n, with probability at least 1 pexp( C τ 1,p n1 κ ), it holds that max 1 p σ /σ 1 cn κ /8, for some positive constant C, where c and κ are constants defined in Condition 3. Proof. Wewillfirstdecompose σ σ intoseveral terms,andthenprove deviation bounds for each term. Denote Ω 1 by the operator norm of Ω 1. Note that under Condition, Ω 1 is bounded from above by constant τ 1 1. So z(k) = Ω 1 z (k) for k = 1, are also sub-gaussian distributed. Recall that Z = ZΩ 1 is the transformed data matrix. Denote by Z = ( z i ) n p. Then z i are independent and identically distributed across i with mixture sub-gaussian distribution and variance σ. Since σ is the pooled sample variance estimate for the th transformed feature Z, we have σ = ( z i z ) /n,
13 INNOVATED INTERACTION SCREENING 13 where z = n z i/n is the pooled sample mean estimate for Z. Let µ = E( z i ) and µ (1) = E( z (1) i ). It is clear that µ = π µ (1). By some simple calculation, we have the following decomposition for σ σ, σ σ = ([ z i µ ] σ )/n ( z µ ). Since z i = i z (1) i +(1 i ) z () i, we have z i µ = i ( z (1) i µ (1) )+(1 i ) z () i +( i π) µ (1), where z (1) i µ(1) and z () i are sub-gaussian distributed with mean 0 and variances (σ (1) ) and (σ () ), respectively. By the proof of Lemma 6, we know that σ = π(σ(1) ) + (1 π)(σ () ) + π(1 π)( µ (1) ). As i = i, (1 i ) = 1 i and i (1 i ) = 0, by replacing z i with i z (1) i +(1 i) z () i and spreading out the terms, we can further decompose σ σ as σ σ =1 n { i [( z (1) i ) (σ (1) ) ]+(1 i )[( z () i ) (σ () ) ] +[( i π) π(1 π)]( µ (1) ) +( i π)[(σ (1) ) (σ () ) ] +( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ]} ( z µ ). Let S = {1 i n : i = 1} and ε n be any positive sequence such that ε n 0 and ε n τ 1,p 0 as n. It follows from the above decomposition
14 14 Y. FAN, Y. KONG, D. LI AND Z. ZHENG that (A.1) P( σ σ > ε nσ /) P(1 n i S +P( 1 n i S c [( z () i ) (σ () ) ] > ε nσ 1 ) [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) +P( 1 n [( i π) π(1 π)]( µ (1) ) > ε nσ 1 ) +P( 1 n ( +P( 1 n i nπ)[(σ (1) ) (σ () ) ] > ε nσ 1 ) ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) +P{( z µ ) > ε nσ 1 }. We will bound the six terms on the right hand side above one by one using some deviation results. Thesame notation C will beusedto denote a generic positive constant without loss of generality. Recall that n 1 = n i and n = n n 1. By Lemma 6, σ are uniformly boundedfrombelowby τ 1,p.Thus,bythedeviationofsub-Gaussianin(A.1), conditioning on any realization of = ( 1,, n ) T, we obtain P( 1 n i S P( 1 n 1 i S exp( Cε n τ 1,p [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) [( z (1) i ) (σ (1) ) ] > ε n τ 1,p 1 ) Cexp( ρε n τ 1,pn 1 /1 ) i ). Since ε n τ 1,p 0 as n and n i is a Binomial random variable with probability of success π, for sufficiently large n such that ε n τ 1,p is small
15 INNOVATED INTERACTION SCREENING 15 enough, taking expectation on both sides above yields (A.) P( 1 n i S [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) E{exp( Cε n τ 1,p = {1 π +πexp( Cε n τ 1,p )}n = exp{nln[1 π +πexp( Cε n τ 1,p )]} i )} exp{ nπ[1 exp( Cε n τ 1,p )]} exp( nπcε n τ 1,p ) = exp( Cε n τ 1,p n), where we have used the inequalities that log(1+x) x for any x > 1, and 1 exp( x) x for sufficiently small x > 0. This gives an upper bound on the first term in (A.1). Similar to (A.), the second term in (A.1) can be bounded as (A.3) P( 1 n i S c [( z () i ) (σ () ) ] > ε nσ 1 ) exp( Cε n τ 1,p n). As z (1) is sub-gaussian distributed, it follows from Lemma 11 that µ (1) are uniformly bounded from above by some positive constant across. Since i are independently Bernoulli distributed with success probability π, we know that E{( i π) } = π(1 π) and ( i π) are i.i.d. and bounded from above by 1. By Hoeffding s inequality, we have the following bound for the third term in (A.1), (A.4) P( 1 n [( i π) π(1 π)]( µ (1) ) > ε nσ 1 ) P( 1 n ( i π) π(1 π) > ε n τ 1,p ε 1( µ (1) exp( n τ 1,p n ) ) 1 ( µ (1) ) 4) exp( Cε n τ 1,pn). Due to the fact that σ = π(σ(1) ) + (1 π)(σ () ) + π(1 π)( µ (1) ), we have σ π(1 π)[(σ (1) ) + (σ () ) ]. Similar to (A.4), by applying
16 16 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Hoeffding s inequality, the fourth term in (A.1) can be bounded as (A.5) P( 1 n ( i nπ)[(σ (1) ) (σ () ) ] > ε nσ 1 ) P( 1 σ i π [ n π(1 π) ] > ε nσ 1 ) P( 1 n exp( nπ (1 π) ε n 1 ) exp( Cε n τ 1,p n). For the fifth term in (A.1), we first decompose it as (A.6) P( 1 n i π > π(1 π)ε n ) 1 ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) P( 1 n (1 π)( z (1) i ) π z () i ] > ε nσ i S i S c 1 µ (1) ) P( 1 n (1 π)( z (1) i ) > ε nσ i S 4 µ (1) )+P( 1 n π z () i > ε nσ i S c 4 µ (1) ). Recall that σ π(σ(1) ) and max 1 p µ (1) can be bounded from above by some positive constant. Applying Bernstein s inequality to the sum of independent sub-gaussian random variables yields P( 1 n (1 π)( z (1) i ) > ε nσ i S 4 µ (1) ) P( 1 ( z (1) i ) πεn σ > n 1 i S σ (1) 48(1 π) µ (1) ) πε exp( n τ 1,pn 1 96 (1 π) ( µ (1) ) ) exp( Cεn τ 1,p i ). Applying the same argument as in (A.), taking expectation on both sides above then we can obtain P( 1 n i S (1 π)( z (1) i ) > ε nσ 4 µ (1) ) exp( Cε n τ 1,p n).
17 INNOVATED INTERACTION SCREENING 17 Similarly we have ( 1 P n π z () i > ε ) nσ i S c 4 µ (1) exp( Cε n τ 1,pn). In view of (A.6), the two bounds we obtained yield (A.7) P( 1 n ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) exp( Cε n τ 1,pn), which is the upper bound for the fifth term in (A.1). Applying Bernstein s inequality similarly to the sixth term in (A.1) gives P{( z µ ) > ε nσ 1 } P( z µ εn > σ 1 ) exp( ε nn 4 ). Combining the six bounds for the terms in (A.1) that we have obtained yields (A.8) P( σ /σ 1 > ε n /) P( σ σ > ε n σ/) exp( Cε n τ 1,pn). It follows that (A.9) P(max 1 p σ /σ 1 > ε n/) P( σ /σ 1 > ε n/) 1 p pexp( Cε n τ 1,p n). Let ε n = cn κ /4 with constants c and κ defined in Condition 3. Since τ 1,p 1, we know that ε n τ 1,p 0 as n. Thus we can replace ε n with cn κ /4 in (A.9) and get (A.30) P( max 1 p σ /σ 1 > cn κ /8) pexp( C τ 1,pn 1 κ ). This completes the proof of Lemma 7. B.3. Lemma 8 and its proof. Lemma 8. Under model setting() and Condition, for sufficiently large n, with probability at least 1 p exp( C n), it holds that (Z Z) T (Z Z) max /n c 3 τ,p for some positive constants C and c 3 >, where Z = 1 n ( z 1,, z p ) with z = n z i/n and 1 n the n 1 column vector with all components 1.
18 18 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Proof. Thedeviation boundof (Z Z) T (Z Z) max /n can beobtained by bounding each component of (Z Z) T (Z Z). Recall that µ = E( z ) = πµ (1) with µ (1) = E(z (1) i ). Note that the th diagonal component of (Z Z) T (Z Z)/n can be bounded as (A.31) (z i z ) /n = (z i µ ) /n ( z µ ) (z i µ ) /n, Let Z = (z i ) n p, then the components of each row in Z are independent and identically distributed (i.i.d.) with a mixture sub-gaussian distribution which can be written as z i = i z (1) i + (1 i )z () i, where i are i.i.d. Bernoulli random variables. Denote by Σ 1 = (σ (1) i ) p p and Σ = (σ () i ) p p. Then for each = 1,,p, the random variables z (1) i µ (1) and z () i are independent across i with mean 0 and variances σ (1) and σ (), respectively. We will then bound n (z i µ ) /n by some deviation results. The same notation C will be used to denote a generic constant without loss of generality. For any 1 i n, we have the following decomposition for n (z i µ ) /n, (z i µ ) /n = n 1 { i (z (1) (A.3) i µ (1) ) +(1 i )(z () i ) +( i π) (µ (1) +( i π)µ (1) [ i (z (1) i µ (1) )+(1 i )(z () i )]}. Recall that S = {1 i n : i = 1}, n 1 = n i and n = n n 1. By Condition, both σ (1) and σ () can be uniformly bounded from above by τ,p across. For the first term in (A.3) and any positive constant c 3, applying the argument with sub-gaussian deviation and Taylor expansion similarly as in (A.) gives P(n 1 { i (z (1) P(n 1 i S i µ (1) ) } > τ,p +c 3 ) P(n 1 i S [(z (1) i µ (1) ) σ (1) ] > c 3) exp( C n). Similarly, we have for the second term in (A.3), P(n 1 (1 i )(z () i ) > τ,p +c 3 ) exp( C n). ) (z (1) i µ (1) ) > σ (1) +c 3)
19 INNOVATED INTERACTION SCREENING 19 Since µ (1) are uniformly bounded from above by some positive constant across from Lemma 11, similar to (A.4), by Hoeffding s inequality, we have the following bound for the third term in (A.3), P{n 1 ( i π) (µ (1) ) > π(1 π)(µ (1) ) +c 3 } exp( C n). For the last term in (A.3), similar to (A.7), by Bernstein s inequality we have P{n 1 ( i π)µ (1) [ i (z (1) i µ (1) )+(1 i )(z () i )] > c 3} exp( C n). In view of (A.3), by the similar argument as in (A.1), combining the four bounds we have obtained gives P( (z i µ ) /n > τ,p +C 3 ) exp( C n), where C 3 = π(1 π)(µ (1) ) + 4c 3. As shown in (A.31), n (z i z ) /n can be bounded from above by n (z i µ ) /n. Thus we have P( (z i z ) /n > τ,p +C 3 ) exp( C n). Then for the (i,)th component of (Z Z) T (Z Z), by the Cauchy- Schwarz inequality, it follows that P( (z ki z i )(z k z )/n > τ,p +C 3 ) k=1 P{[n 1 (z ki z i ) ][n 1 (z k z ) ] > [τ,p +C 3 ] } k=1 k=1 P{n 1 (z ki z i ) > τ,p +C 3 }+P{n 1 (z k z ) > τ,p +C 3 } k=1 exp( C n). Sincetheaboveinequalityholdsforany(i,)th componentof(z Z) T (Z Z), similar to (A.9), we conclude that k=1 P( (Z Z) T (Z Z) max /n > τ,p +C 3 ) p exp( C n)
20 0 Y. FAN, Y. KONG, D. LI AND Z. ZHENG As τ,p is allowed to diverge, we can choose some constant c 3 > such that c 3 τ,p > τ,p +C 3, which gives (A.33) P( (Z Z) T (Z Z) max /n > c 3 τ,p ) p exp( C n). It completes the proof of Lemma 8. B.4. Lemma 9 and its proof. Lemma 9. Assume that z (1) R p and z () R p are sub-gaussian, and follows a Bernoulli distribution with probability of success π. Let z = z (1) +(1 )z (). Then z is also sub-gaussian. Proof. Since z (1) R p is sub-gaussian, there exists some positive constants a 1 and b 1 such that P( v T z (1) > t) a 1 exp( b 1 t ) for any vector v R p satisfying v = 1 and any t > 0. Similarly, there exists some positive constants a and b such that P( v T z () > t) a exp( b t ) for any vector v R p satisfying v = 1 and any t > 0 since z () R p is sub-gaussian. Let a 3 = max{a 1,a } and b 3 = min{b 1,b }. Then we have P( v T z (k) > t) a 3 exp( b 3 t ) for k = 1,. This, together with the law of total probability, yields P( v T z > t) =P( v T z > t = 1)P( = 1)+P( v T z > t = 0)P( = 0) =πp( v T z (1) > t)+(1 π)p( v T z () > t) a 3 exp( b 3 t ). Thus z is sub-gaussian. B.5. Lemma 10 and its proof. Lemma 10. Ifarandomvectorw=(W 1,,W p ) T R p issub-gaussian, then so is w E(w). Proof. If w is sub-gaussian, then there exist some positive constants a 1 and b 1 such that P( v T w > t) a 1 exp( b 1 t )
21 INNOVATED INTERACTION SCREENING 1 for any vector v R p satisfying v = 1 and any t > 0. Let µ = E(w). Thus from the Cauchy-Schwarz inequality and Lemma 11, we have (v T µ) = E (v T w) E[(v T w) ] (1+a 1 )c 1 with c = b 1 /. Note that ( a b ) ( a + b ) (a +b ) for any a and b. Thus, for any vector v R p satisfying v = 1 and any t > 0, P{ v T [w E(w)] > t} P{ v T w + v T µ > t} P{(v T w) +(v T µ) > t } P{e c(vt w) > e ct / c(v T µ) } e ct /+c(v T µ) E[e c(vt w) ] (1+a 1 )e 1+a1 e ct / = a exp( b t ) with a = (1 + a 1 )e 1+a 1 and b = c/ = b 1 /4. Thus w E(w) is sub- Gaussian. B.6. Lemma 11 and its proof. Lemma 11. Let W be a nonnegative random variable such that P(W > t) aexp( bt ) for any t > 0, where a and b are positive constants. Then E[exp( 1 bw )] 1 + a and E(W m ) (1 + a)(/b) m m! for any integer m 0. Proof. Denote by F(t) the cumulative distribution function of W. Then for all x > 0, we have 1 F(t) = P(W > t) aexp( bt ). For any constant 0 < c < b, integration by parts yields E(e cw ) = e ct d[1 F(t)] = 1+ cate (b c)t dt = 1+ ca b c. Thus letting c = b/ gives E[exp( 1 bw )] 1+a. Note that m b m E(W m ) m! k=0 0 cte ct [1 F(t)]dt E( 1 bw ) k = E[exp( 1 bw )] 1+a. k! This implies E(W m ) (1+a)(/b) m m! for any integer m 0. This completes the proof of Lemma 11.
22 Y. FAN, Y. KONG, D. LI AND Z. ZHENG B.7. Lemma 1. Lemma 1 (Lemma 8 of Hao and Zhang []). Let W 1,,W n be independent random variables with zero mean. If E[exp(T 0 W i α )] c 1 for some constants T 0 > 0, c 1 > 0 and 0 < α 1, then there exist positive constants c and c 3 such that for any 0 < ε 1. P( n 1 W i > ε) c exp( c 3 n α ε ) B.8. Lemma 13 and its proof. Lemma 13. Assume that max 1 p E[exp( c 1 Z1 )] c holds with some positive constants c 1 and c. If for each 1 p, the random variables Z 1,,Z n are independent and identically distributed, then { } P n 1 [Z i E(Z i )] ε c 3 exp( c 4 nε ) { } P n 1 [Z i Z ik E(Z i Z ik )] ε c 3 exp( c 4 nε ) { } P n 1 [Z i Z ik Z il E(Z i Z ik Z il )] ε c 3 exp( c 4 n /3 ε ) { } P n 1 [Z ik Z il Z ik Z il E(Z ik Z il Z ik Z il )] ε c 3 exp( c 4 n 1/ ε ) for any 0 < ε < 1, where 1,k,l,k,l p, and c 3 and c 4 are generic positive constants which may vary from line to line. Proof. The idea of the proof is to use Lemma 1 and similar to that for Lemma 9 in Hao and Zhang []. So we omit the details here. REFERENCES [1] Cai, T., Zhang, C.-H., and Zhou, H.(008). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38, [] Hao, N. and Zhang, H. H. (014). Interaction Screening for Ultra-High Dimensional Data. J. Amer. Statist. Assoc., to appear. [3] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58,
23 INNOVATED INTERACTION SCREENING 3 Data Sciences and Operations Department Marshall School of Business University of Southern California Los Angeles, CA USA fanyingy@marshall.usc.edu daoili@marshall.usc.edu Department of Mathematics University of Southern California Los Angeles, CA USA zeminzhe@usc.edu Department of Preventive Medicine Keck School of Medicine University of Southern California Los Angeles, CA USA yinfeiko@usc.edu
SUPPLEMENTARY MATERIAL TO INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION
IPDC SUPPLEMENTARY MATERIAL TO INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION By Yinfei Kong, Daoji Li, Yingying Fan and Jinchi Lv California State University
More informationINNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION 1
The Annals of Statistics 2015, Vol. 43, No. 3, 1243 1272 DOI: 10.1214/14-AOS1308 Institute of Mathematical Statistics, 2015 INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationSupplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang
Supplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang The supplementary material contains some additional simulation
More informationAssumptions for the theoretical properties of the
Statistica Sinica: Supplement IMPUTED FACTOR REGRESSION FOR HIGH- DIMENSIONAL BLOCK-WISE MISSING DATA Yanqing Zhang, Niansheng Tang and Annie Qu Yunnan University and University of Illinois at Urbana-Champaign
More informationUC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes
UC Berkeley Department of Electrical Engineering and Computer Sciences EECS 6: Probability and Random Processes Problem Set 3 Spring 9 Self-Graded Scores Due: February 8, 9 Submit your self-graded scores
More informationarxiv: v1 [stat.me] 11 May 2016
Submitted to the Annals of Statistics INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION arxiv:1605.03315v1 [stat.me] 11 May 2016 By Yinfei Kong, Daoji Li, Yingying
More information2. Matrix Algebra and Random Vectors
2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns
More informationExercises and Answers to Chapter 1
Exercises and Answers to Chapter The continuous type of random variable X has the following density function: a x, if < x < a, f (x), otherwise. Answer the following questions. () Find a. () Obtain mean
More informationSubmitted to the Brazilian Journal of Probability and Statistics
Submitted to the Brazilian Journal of Probability and Statistics Multivariate normal approximation of the maximum likelihood estimator via the delta method Andreas Anastasiou a and Robert E. Gaunt b a
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationSTAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song
STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April
More informationAn Introduction to Signal Detection and Estimation - Second Edition Chapter III: Selected Solutions
An Introduction to Signal Detection and Estimation - Second Edition Chapter III: Selected Solutions H. V. Poor Princeton University March 17, 5 Exercise 1: Let {h k,l } denote the impulse response of a
More informationConvergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit
Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Evan Kwiatkowski, Jan Mandel University of Colorado Denver December 11, 2014 OUTLINE 2 Data Assimilation Bayesian Estimation
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More information04. Random Variables: Concepts
University of Rhode Island DigitalCommons@URI Nonequilibrium Statistical Physics Physics Course Materials 215 4. Random Variables: Concepts Gerhard Müller University of Rhode Island, gmuller@uri.edu Creative
More informationUniversal examples. Chapter The Bernoulli process
Chapter 1 Universal examples 1.1 The Bernoulli process First description: Bernoulli random variables Y i for i = 1, 2, 3,... independent with P [Y i = 1] = p and P [Y i = ] = 1 p. Second description: Binomial
More informationThe largest eigenvalues of the sample covariance matrix. in the heavy-tail case
The largest eigenvalues of the sample covariance matrix 1 in the heavy-tail case Thomas Mikosch University of Copenhagen Joint work with Richard A. Davis (Columbia NY), Johannes Heiny (Aarhus University)
More informationGaussian Elimination for Linear Systems
Gaussian Elimination for Linear Systems Tsung-Ming Huang Department of Mathematics National Taiwan Normal University October 3, 2011 1/56 Outline 1 Elementary matrices 2 LR-factorization 3 Gaussian elimination
More informationOptimal rates of convergence for estimating Toeplitz covariance matrices
Probab. Theory Relat. Fields 2013) 156:101 143 DOI 10.1007/s00440-012-0422-7 Optimal rates of convergence for estimating Toeplitz covariance matrices T. Tony Cai Zhao Ren Harrison H. Zhou Received: 17
More informationCorrelation dimension for self-similar Cantor sets with overlaps
F U N D A M E N T A MATHEMATICAE 155 (1998) Correlation dimension for self-similar Cantor sets with overlaps by Károly S i m o n (Miskolc) and Boris S o l o m y a k (Seattle, Wash.) Abstract. We consider
More informationTail bound inequalities and empirical likelihood for the mean
Tail bound inequalities and empirical likelihood for the mean Sandra Vucane 1 1 University of Latvia, Riga 29 th of September, 2011 Sandra Vucane (LU) Tail bound inequalities and EL for the mean 29.09.2011
More informationThe Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression
The Sparsity and Bias of The LASSO Selection In High-Dimensional Linear Regression Cun-hui Zhang and Jian Huang Presenter: Quefeng Li Feb. 26, 2010 un-hui Zhang and Jian Huang Presenter: Quefeng The Sparsity
More informationarxiv: v1 [math.pr] 22 May 2008
THE LEAST SINGULAR VALUE OF A RANDOM SQUARE MATRIX IS O(n 1/2 ) arxiv:0805.3407v1 [math.pr] 22 May 2008 MARK RUDELSON AND ROMAN VERSHYNIN Abstract. Let A be a matrix whose entries are real i.i.d. centered
More informationTHE DVORETZKY KIEFER WOLFOWITZ INEQUALITY WITH SHARP CONSTANT: MASSART S 1990 PROOF SEMINAR, SEPT. 28, R. M. Dudley
THE DVORETZKY KIEFER WOLFOWITZ INEQUALITY WITH SHARP CONSTANT: MASSART S 1990 PROOF SEMINAR, SEPT. 28, 2011 R. M. Dudley 1 A. Dvoretzky, J. Kiefer, and J. Wolfowitz 1956 proved the Dvoretzky Kiefer Wolfowitz
More informationGaussian vectors and central limit theorem
Gaussian vectors and central limit theorem Samy Tindel Purdue University Probability Theory 2 - MA 539 Samy T. Gaussian vectors & CLT Probability Theory 1 / 86 Outline 1 Real Gaussian random variables
More informationLIST OF FORMULAS FOR STK1100 AND STK1110
LIST OF FORMULAS FOR STK1100 AND STK1110 (Version of 11. November 2015) 1. Probability Let A, B, A 1, A 2,..., B 1, B 2,... be events, that is, subsets of a sample space Ω. a) Axioms: A probability function
More information4 Sums of Independent Random Variables
4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationAdditive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535
Additive functionals of infinite-variance moving averages Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Departments of Statistics The University of Chicago Chicago, Illinois 60637 June
More informationA One-to-One Code and Its Anti-Redundancy
A One-to-One Code and Its Anti-Redundancy W. Szpankowski Department of Computer Science, Purdue University July 4, 2005 This research is supported by NSF, NSA and NIH. Outline of the Talk. Prefix Codes
More informationSelected Exercises on Expectations and Some Probability Inequalities
Selected Exercises on Expectations and Some Probability Inequalities # If E(X 2 ) = and E X a > 0, then P( X λa) ( λ) 2 a 2 for 0 < λ
More informationESTIMATION IN A SEMI-PARAMETRIC TWO-STAGE RENEWAL REGRESSION MODEL
Statistica Sinica 19(29): Supplement S1-S1 ESTIMATION IN A SEMI-PARAMETRIC TWO-STAGE RENEWAL REGRESSION MODEL Dorota M. Dabrowska University of California, Los Angeles Supplementary Material This supplement
More informationBahadur representations for bootstrap quantiles 1
Bahadur representations for bootstrap quantiles 1 Yijun Zuo Department of Statistics and Probability, Michigan State University East Lansing, MI 48824, USA zuo@msu.edu 1 Research partially supported by
More information5: MULTIVARATE STATIONARY PROCESSES
5: MULTIVARATE STATIONARY PROCESSES 1 1 Some Preliminary Definitions and Concepts Random Vector: A vector X = (X 1,..., X n ) whose components are scalarvalued random variables on the same probability
More informationInvertibility of symmetric random matrices
Invertibility of symmetric random matrices Roman Vershynin University of Michigan romanv@umich.edu February 1, 2011; last revised March 16, 2012 Abstract We study n n symmetric random matrices H, possibly
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More informationRandom Variables. P(x) = P[X(e)] = P(e). (1)
Random Variables Random variable (discrete or continuous) is used to derive the output statistical properties of a system whose input is a random variable or random in nature. Definition Consider an experiment
More informationA Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices
A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices Natalia Bailey 1 M. Hashem Pesaran 2 L. Vanessa Smith 3 1 Department of Econometrics & Business Statistics, Monash
More informationDe-biasing the Lasso: Optimal Sample Size for Gaussian Designs
De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel
More informationPart IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015
Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.
More informationSupplemental Material for KERNEL-BASED INFERENCE IN TIME-VARYING COEFFICIENT COINTEGRATING REGRESSION. September 2017
Supplemental Material for KERNEL-BASED INFERENCE IN TIME-VARYING COEFFICIENT COINTEGRATING REGRESSION By Degui Li, Peter C. B. Phillips, and Jiti Gao September 017 COWLES FOUNDATION DISCUSSION PAPER NO.
More informationarxiv: v1 [math.pr] 22 Dec 2018
arxiv:1812.09618v1 [math.pr] 22 Dec 2018 Operator norm upper bound for sub-gaussian tailed random matrices Eric Benhamou Jamal Atif Rida Laraki December 27, 2018 Abstract This paper investigates an upper
More informationNotes on Random Vectors and Multivariate Normal
MATH 590 Spring 06 Notes on Random Vectors and Multivariate Normal Properties of Random Vectors If X,, X n are random variables, then X = X,, X n ) is a random vector, with the cumulative distribution
More information{η : η=linear combination of 1, Z 1,, Z n
If 3. Orthogonal projection. Conditional expectation in the wide sense Let (X n n be a sequence of random variables with EX n = σ n and EX n 0. EX k X j = { σk, k = j 0, otherwise, (X n n is the sequence
More informationPh.D. Qualifying Exam Friday Saturday, January 6 7, 2017
Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a
More informationProbability Background
Probability Background Namrata Vaswani, Iowa State University August 24, 2015 Probability recap 1: EE 322 notes Quick test of concepts: Given random variables X 1, X 2,... X n. Compute the PDF of the second
More informationLecture 2: Review of Basic Probability Theory
ECE 830 Fall 2010 Statistical Signal Processing instructor: R. Nowak, scribe: R. Nowak Lecture 2: Review of Basic Probability Theory Probabilistic models will be used throughout the course to represent
More informationLectures 6: Degree Distributions and Concentration Inequalities
University of Washington Lecturer: Abraham Flaxman/Vahab S Mirrokni CSE599m: Algorithms and Economics of Networks April 13 th, 007 Scribe: Ben Birnbaum Lectures 6: Degree Distributions and Concentration
More informationChapter 7. Basic Probability Theory
Chapter 7. Basic Probability Theory I-Liang Chern October 20, 2016 1 / 49 What s kind of matrices satisfying RIP Random matrices with iid Gaussian entries iid Bernoulli entries (+/ 1) iid subgaussian entries
More informationRATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University
Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active
More information1 Exercises for lecture 1
1 Exercises for lecture 1 Exercise 1 a) Show that if F is symmetric with respect to µ, and E( X )
More informationPhenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012
Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM
More informationAsymptotic Properties of an Approximate Maximum Likelihood Estimator for Stochastic PDEs
Asymptotic Properties of an Approximate Maximum Likelihood Estimator for Stochastic PDEs M. Huebner S. Lototsky B.L. Rozovskii In: Yu. M. Kabanov, B. L. Rozovskii, and A. N. Shiryaev editors, Statistics
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2
MA 575 Linear Models: Cedric E Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 Revision: Probability Theory 11 Random Variables A real-valued random variable is
More informationLecture 22 Girsanov s Theorem
Lecture 22: Girsanov s Theorem of 8 Course: Theory of Probability II Term: Spring 25 Instructor: Gordan Zitkovic Lecture 22 Girsanov s Theorem An example Consider a finite Gaussian random walk X n = n
More informationSpectral representations and ergodic theorems for stationary stochastic processes
AMS 263 Stochastic Processes (Fall 2005) Instructor: Athanasios Kottas Spectral representations and ergodic theorems for stationary stochastic processes Stationary stochastic processes Theory and methods
More informationTHE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX
THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX MARK RUDELSON AND ROMAN VERSHYNIN Abstract. We prove an optimal estimate on the smallest singular value of a random subgaussian matrix, valid
More informationNonparametric Bayesian Uncertainty Quantification
Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery
More informationarxiv: v1 [math.ap] 17 May 2017
ON A HOMOGENIZATION PROBLEM arxiv:1705.06174v1 [math.ap] 17 May 2017 J. BOURGAIN Abstract. We study a homogenization question for a stochastic divergence type operator. 1. Introduction and Statement Let
More informationAP Calculus Testbank (Chapter 9) (Mr. Surowski)
AP Calculus Testbank (Chapter 9) (Mr. Surowski) Part I. Multiple-Choice Questions n 1 1. The series will converge, provided that n 1+p + n + 1 (A) p > 1 (B) p > 2 (C) p >.5 (D) p 0 2. The series
More informationTHE EXPONENTIAL DISTRIBUTION ANALOG OF THE GRUBBS WEAVER METHOD
THE EXPONENTIAL DISTRIBUTION ANALOG OF THE GRUBBS WEAVER METHOD ANDREW V. SILLS AND CHARLES W. CHAMP Abstract. In Grubbs and Weaver (1947), the authors suggest a minimumvariance unbiased estimator for
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationTHE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX
THE SMALLEST SINGULAR VALUE OF A RANDOM RECTANGULAR MATRIX MARK RUDELSON AND ROMAN VERSHYNIN Abstract. We prove an optimal estimate of the smallest singular value of a random subgaussian matrix, valid
More informationONLINE APPENDIX TO: NONPARAMETRIC IDENTIFICATION OF THE MIXED HAZARD MODEL USING MARTINGALE-BASED MOMENTS
ONLINE APPENDIX TO: NONPARAMETRIC IDENTIFICATION OF THE MIXED HAZARD MODEL USING MARTINGALE-BASED MOMENTS JOHANNES RUF AND JAMES LEWIS WOLTER Appendix B. The Proofs of Theorem. and Proposition.3 The proof
More informationLecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1
Random Walks and Brownian Motion Tel Aviv University Spring 011 Lecture date: May 0, 011 Lecture 9 Instructor: Ron Peled Scribe: Jonathan Hermon In today s lecture we present the Brownian motion (BM).
More information6.1 Moment Generating and Characteristic Functions
Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,
More informationarxiv: v6 [math.st] 25 Nov 2010
Confidence Interval for the Mean of a Bounded Random Variable and Its Applications in Point arxiv:080.458v6 [math.st] 5 Nov 010 Estimation Xinjia Chen November, 010 Abstract In this article, we derive
More informationRandom matrices: Distribution of the least singular value (via Property Testing)
Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu Department of Mathematics Rutgers vanvu@math.rutgers.edu (joint work with T. Tao, UCLA) 1 Let ξ be a real or complex-valued
More informationA simple branching process approach to the phase transition in G n,p
A simple branching process approach to the phase transition in G n,p Béla Bollobás Department of Pure Mathematics and Mathematical Statistics Wilberforce Road, Cambridge CB3 0WB, UK b.bollobas@dpmms.cam.ac.uk
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationMULTI-RESTRAINED STIRLING NUMBERS
MULTI-RESTRAINED STIRLING NUMBERS JI YOUNG CHOI DEPARTMENT OF MATHEMATICS SHIPPENSBURG UNIVERSITY SHIPPENSBURG, PA 17257, U.S.A. Abstract. Given positive integers n, k, and m, the (n, k)-th m- restrained
More informationLarge Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n
Large Sample Theory In statistics, we are interested in the properties of particular random variables (or estimators ), which are functions of our data. In ymptotic analysis, we focus on describing the
More information7. MULTIVARATE STATIONARY PROCESSES
7. MULTIVARATE STATIONARY PROCESSES 1 1 Some Preliminary Definitions and Concepts Random Vector: A vector X = (X 1,..., X n ) whose components are scalar-valued random variables on the same probability
More informationRandom regular digraphs: singularity and spectrum
Random regular digraphs: singularity and spectrum Nick Cook, UCLA Probability Seminar, Stanford University November 2, 2015 Universality Circular law Singularity probability Talk outline 1 Universality
More informationEcon 5150: Applied Econometrics Dynamic Demand Model Model Selection. Sung Y. Park CUHK
Econ 5150: Applied Econometrics Dynamic Demand Model Model Selection Sung Y. Park CUHK Simple dynamic models A typical simple model: y t = α 0 + α 1 y t 1 + α 2 y t 2 + x tβ 0 x t 1β 1 + u t, where y t
More informationMarch 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.
Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)
More informationOn the convergence of sequences of random variables: A primer
BCAM May 2012 1 On the convergence of sequences of random variables: A primer Armand M. Makowski ECE & ISR/HyNet University of Maryland at College Park armand@isr.umd.edu BCAM May 2012 2 A sequence a :
More informationNETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA
Statistica Sinica 213): Supplement NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA Hokeun Sun 1, Wei Lin 2, Rui Feng 2 and Hongzhe Li 2 1 Columbia University and 2 University
More informationRandom Methods for Linear Algebra
Gittens gittens@acm.caltech.edu Applied and Computational Mathematics California Institue of Technology October 2, 2009 Outline The Johnson-Lindenstrauss Transform 1 The Johnson-Lindenstrauss Transform
More information8 Laws of large numbers
8 Laws of large numbers 8.1 Introduction We first start with the idea of standardizing a random variable. Let X be a random variable with mean µ and variance σ 2. Then Z = (X µ)/σ will be a random variable
More information1 Math 241A-B Homework Problem List for F2015 and W2016
1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More information19.1 Problem setup: Sparse linear regression
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In
More informationEE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term
EE 150 - Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko JPL Third Term 2011-2012 Due on Thursday May 3 in class. Homework Set #4 1. (10 points) (Adapted
More informationEstimation of Treatment Effects under Essential Heterogeneity
Estimation of Treatment Effects under Essential Heterogeneity James Heckman University of Chicago and American Bar Foundation Sergio Urzua University of Chicago Edward Vytlacil Columbia University March
More informationMOMENT CONVERGENCE RATES OF LIL FOR NEGATIVELY ASSOCIATED SEQUENCES
J. Korean Math. Soc. 47 1, No., pp. 63 75 DOI 1.4134/JKMS.1.47..63 MOMENT CONVERGENCE RATES OF LIL FOR NEGATIVELY ASSOCIATED SEQUENCES Ke-Ang Fu Li-Hua Hu Abstract. Let X n ; n 1 be a strictly stationary
More informationChapter 6: Random Processes 1
Chapter 6: Random Processes 1 Yunghsiang S. Han Graduate Institute of Communication Engineering, National Taipei University Taiwan E-mail: yshan@mail.ntpu.edu.tw 1 Modified from the lecture notes by Prof.
More informationChp 4. Expectation and Variance
Chp 4. Expectation and Variance 1 Expectation In this chapter, we will introduce two objectives to directly reflect the properties of a random variable or vector, which are the Expectation and Variance.
More informationLecture 16: Sample quantiles and their asymptotic properties
Lecture 16: Sample quantiles and their asymptotic properties Estimation of quantiles (percentiles Suppose that X 1,...,X n are i.i.d. random variables from an unknown nonparametric F For p (0,1, G 1 (p
More informationINTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION 1
The Annals of Statistics 017, Vol. 45, No., 897 9 DOI: 10.114/16-AOS1474 Institute of Mathematical Statistics, 017 INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION
More informationSUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS
Submitted to the Annals of Statistics SUPPLEMENT TO PAPER CONERGENCE OF ADAPTIE AND INTERACTING MARKO CHAIN MONTE CARLO ALGORITHMS By G Fort,, E Moulines and P Priouret LTCI, CNRS - TELECOM ParisTech,
More informationChapter 4 : Expectation and Moments
ECE5: Analysis of Random Signals Fall 06 Chapter 4 : Expectation and Moments Dr. Salim El Rouayheb Scribe: Serge Kas Hanna, Lu Liu Expected Value of a Random Variable Definition. The expected or average
More informationA remark on the maximum eigenvalue for circulant matrices
IMS Collections High Dimensional Probability V: The Luminy Volume Vol 5 (009 79 84 c Institute of Mathematical Statistics, 009 DOI: 04/09-IMSCOLL5 A remark on the imum eigenvalue for circulant matrices
More informationAsymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University
More informationELEMENTS OF PROBABILITY THEORY
ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable
More informationFormulas for probability theory and linear models SF2941
Formulas for probability theory and linear models SF2941 These pages + Appendix 2 of Gut) are permitted as assistance at the exam. 11 maj 2008 Selected formulae of probability Bivariate probability Transforms
More informationOn prediction and density estimation Peter McCullagh University of Chicago December 2004
On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating
More information