arxiv: v1 [stat.ml] 5 Jan 2015

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 5 Jan 2015"

Elmer Jackson
5 years ago
Views:

1 To appear on the Annals of Statistics SUPPLEMENTARY MATERIALS FOR INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION arxiv: v1 [stat.ml] 5 Jan 015 By Yingying Fan, Yinfei Kong, Daoi Li and Zemin Zheng University of Southern California APPENDIX A: PROOFS FOR PROPOSITION 1 AND MAIN LEMMAS In this section, we prove all key Lemmas in the order they appear in the main text. Additional Lemmas and their proofs are provided in Appendix B. Deviation of sub-gaussian distribution. Recall that a random vector w = (W 1,,W p ) T R p is sub-gaussian if there exist some positive constants a and b such that P( v T w > t) aexp( bt ) for any t > 0 and any vector v R p satisfying v = 1. Suppose w = (W 1,,W p ) is sub-gaussian with constants a, b, mean µ and covariance matrix Σ. Let w 1,,w n be n independent copies of w. Since w is sub-gaussian, by Lemma 10, we have w µ is also sub-gaussian. Then there exists some constants ã and b such that P( v T (w i µ)(w i µ) T v > x) ãexp( bx) for all x > 0 and v = 1, which implies that E{exp[tv T (w i µ)(w i µ) T v]} < for all 0 < t < b and v = 1. Similar as in the proof of Lemma 3 in [1], we know that there exist some constants C > 0 and ρ > 0 depending on ã and b, such that for all 0 < x < ρ and any unit vector v = 1, (A.1) P{ (1/n) v T (w i µ)(w i µ) T v v T Σv > x} Ce nxρ/. 1

2 Y. FAN, Y. KONG, D. LI AND Z. ZHENG A.1. Proof of Proposition 1. The main idea is to prove that our test statistic D converges to its population counterpart D uniformly over. Since Condition 3 ensures that D is bounded from below for A 1 and D = 0 for A c 1, the uniform convergence of D will imply the results in Proposition 1. We proceed to prove the uniform convergence of D to D. To this end, we decompose the difference between them into the following five terms: (A.) D D log( σ /σ ) log [ (1) +π ( σ ) /(σ (1) ) ] log [ () +(1 π) ( σ ) /(σ () ) ] + n 1 /n π log [ ( σ (1) ) ] + n /n (1 π) log [ ( σ () ) ]. We will establish successively the deviation bounds of the terms on the right hand side above. The same notation C will be used to denote a generic constant without loss of generality. By Lemma 7, the estimators σ converge to σ uniformly over all = 1,,p, with probability at least 1 pexp( C τ 1,p n1 κ ). Define this event as E. We will condition on the event E hereafter. Since x 1 n log(1+x n) 1 as x n 0, it follows that log( σ /σ )/( σ /σ 1) 1 uniformlyfor all as n. Thus,uniformlyover all = 1,,p, with sufficiently large n, we have the following bound for the first term log( σ /σ ) on the right hand side of (A.) (A.3) P( log( σ /σ ) > 4 1 cn κ E) P( σ /σ 1 > 8 1 cn κ E), where constants c and κ are defined in Condition 3. Then (A.3) together with (A.7) in the proof of Lemma 7 entails that (A.4) P( log( σ /σ ) > 4 1 cn κ E) exp( C τ 1,p n1 κ ). Using the same arguments, for either k = 1 or, we can prove that (A.5) P( log[( σ (k) ) /(σ (k) ) ] > 4 1 cn κ E) exp( C τ 1,p n1 κ ). By the proof of Lemma 6, we know that τ,p πλ max (Ω 1 )+(1 π)λ max (Ω 1 Σ Ω 1 ). Since ( σ (1) ) and ( σ () ) can be bounded from above by λ max (Ω 1 ) and λ max (Ω 1 Σ Ω 1 ), respectively, we know that ( σ (k) ) can be bounded from

3 INNOVATED INTERACTION SCREENING 3 above by π 1 (1 π) 1 τ,p for k = 1,. As n 1 = n i, by Hoeffding s inequality we get (A.6) P( n 1 /n π log[( σ (1) ) ] > 4 1 cn κ E) P( 1 cn κ i π > n 4 log[π 1 (1 π) 1 τ,p ] E) nc n κ exp( 16log [π 1 (1 π) 1 τ,p ] ) exp{ Cn1 κ /log ( τ,p )}. Similarly, we have (A.7) P( n /n (1 π) log[( σ () ) ] > 4 1 cn κ E) exp{ Cn 1 κ /log ( τ,p )}. In view of (A.), we have (A.8) P( D D > cn κ E) P( log( σ /σ ) > 4 1 cn κ E)+P( log[( σ (1) ) /(σ (1) ) ] > 4 1 cn κ E) + P( log[( σ (1) ) /(σ (1) ) ] > 4 1 cn κ E)+P( n 1 n π log[( σ(1) ) ] > 4 1 cn κ E) + P( n n (1 π) log[( σ() ) ] > 4 1 cn κ E). Combining the probability bounds in (A.4), (A.5), (A.6) and (A.7) gives P( D D > cn κ E) 3exp( C τ 1,p n1 κ )+exp{ Cn 1 κ /log ( τ,p )} It follows that exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}. P( max 1 p D D > cn κ E) p P( D D > cn κ E) =1 pexp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}, which is the deviation of the statistic D from its population counterpart D. By Lemma 7, P(E c ) pexp( C τ 1,p n1 κ ). Thus we have P(max 1 p D D > cn κ ) P(max 1 p D D > cn κ E)+P(E c ) pexp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}.

4 4 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Therefore, for any p satisfying logp = O(n γ ) with γ > 0, γ +κ < 1 and τ 1,p +log ( τ,p ) = o(n 1 κ γ ), it follows that P( max 1 p D D > cn κ ) exp{n γ Cn 1 κ /[ τ 1,p +log ( τ,p )]} exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}. By Condition 3 and its discussion, we know that D 3cn κ when A 1, and D = 0 otherwise. It follows that {min D < cn κ } {max D > cn κ } { max D D > cn κ }, A 1 A c 1 {1,,p} whichshowsthatwithprobabilityatleast1 exp{ Cn 1 κ /[ τ 1,p +log ( τ,p )]}, min D cn κ and max D cn κ A 1 A c 1 for sufficiently large n. As the same conditions hold for the covariance matrix Σ and the data after the second transformation, the results above also apply to the covariates in A with the test statistics calculated based on data transformed by Ω. This completes the proof of Proposition 1. A.. Lemma 1 and its proof. Lemma 1. Under model setting () and conditions in Theorem 1, for sufficiently large n, with probability at least 1 pexp( C τ 1,p n1 κ ), it holds that max 1 p ˆσ / σ 1 T n,p/6 for some positive constant C. Proof. Let Z = 1 n ( z 1,, z p ) with z = n z i/n for the original data matrix without transformation. We will first bound the term ˆσ σ by writing it as ˆσ σ = e T [ Ω 1 (Z Z) T (Z Z) Ω 1 Ω 1 (Z Z) T (Z Z)Ω 1 ]e /n. Through a further decomposition, it gives ˆσ σ = et ( Ω 1 Ω 1 )(Z Z) T (Z Z)( Ω 1 Ω 1 )e /n (A.9) +e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)Ω 1 e /n.

5 INNOVATED INTERACTION SCREENING 5 We will then bound the two terms on the right hand side above separately. Recall that max denotes the componentwise infinity norm for a matrix. For the first term, we have e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)( Ω 1 Ω 1 )e /n (Z Z) T (Z Z) max ( Ω 1 Ω 1 )e 1 /n (Z Z) T (Z Z) max [(K p +K p) C 1 K p (logp)/n] /n = ( (Z Z) T (Z Z) max /n) [(K p +K p ) C 1 K4 p (logp)/n], where the second inequality follows from the definition of acceptable estimator and the fact that Ω 1 Ω 1 is (K p +K p )-sparse. For the second term, similarly we get e T ( Ω 1 Ω 1 )(Z Z) T (Z Z)Ω 1 e /n (Z Z) T (Z Z) max ( Ω 1 Ω 1 )e 1 Ω 1 e 1 /n ( (Z Z) T (Z Z) max /n) [(K p +K p )C 1K p (logp)/n] Kp Ω 1 max. Since Ω 1 max is assumed to be upper bounded in Condition 4, and the above two bounds are independent of the index, in view of (A.9), we know that there exists some constant C such that (A.10) max 1 p ˆσ σ C( (Z Z) T (Z Z) max /n)[(k p +K p )K3 p (logp)/n] max{(k p +K p)k p (logp)/n,1}. This together with the definition of T n,p before Theorem 1 ensures that max 1 p ˆσ σ C( (Z Z) T (Z Z) max /n) T n,p τ 1,p /( C 1 τ,p ). By Lemma 6, σ are uniformly bounded from below by τ 1,p. By Lemma 7, max 1 p σ /σ 1 cn κ /8. Combining these two results entails that for n large enough, σ (1 cn κ /8)σ > τ 1,p/, uniformly for all, with probability at least 1 pexp( C τ 1,p n1 κ ). Denote by E the event that the results in Lemma 7 hold. By Lemma 8, when C 1 1c 3 C, we have (A.11) P(max 1 p ˆσ / σ 1 > T n,p/6 E) P(max 1 p ˆσ σ > T n,p τ 1,p /1 E) P( (Z Z) T (Z Z) max /n > c 3 τ,p E) p exp( C n).

6 6 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Under the conditions in Theorem 1, we have logp = o(n γ ) with γ > 0 and γ +κ < 1. It follows that P( max 1 p ˆσ / σ 1 > T n,p /6) P( max 1 p ˆσ / σ 1 > T n,p /6 E)+P(E c ) p exp( C n)+pexp( C τ 1,pn 1 κ ) = pexp( C τ 1,pn 1 κ ), where we use the same notation C here to denote a generic constant without lost of generality. This completes the proof of Lemma 1. A.3. Lemma and its proof. Lemma. Under Condition 5, we have (A.1) Cn 1 Xδ +pen( θ) n 1 ε T X δ 1 +pen(θ 0 ), where C is some positive constant depending on the positive constant π min in Condition 6, and δ = θ θ 0 is the estimation error for the regularized estimator θ defined in (18), and ε = y E(y X) with y = ( 1,, n ) T. Proof. Definel n (θ) = n 1 n l n(x T i θ, i)wherel n (x T θ, ) = x T θ+ log[1+exp(x T θ)]. Then, in matrix form, l n (θ) can be rewritten as l n (θ) = n 1 {y T Xθ 1 T b(xθ)}, where y = ( 1,, n ) T is an n-dimensional response vector with i {0,1}, X = (x 1,,x n ) T = ( x 1,, x p ) is an n p augmented design matrix, 1 is an n-dimensional vector with each component being one, b(β) = (b(β 1 ),,b(β n )) T is a vector-valued function with β i = x T i θ and b(u) = log[1+exp(u)]. By the definition of θ, we have l n (θ)+pen(θ) l n (θ 0 )+pen(θ 0 ) where θ 0 is the true regression coefficient vector of θ. Rearranging terms yields (A.13) n 1 1 T [b(x θ) b(xθ 0 )]+pen( θ) n 1 y T Xδ +pen(θ 0 ), where δ = θ θ 0 is the estimation error. Applying Taylor expansion to the function of 1 T b(x θ) at θ 0 gives (A.14) 1 T [b(x θ) b(xθ 0 )] = [b (Xθ 0 )] T Xδ + 1 δ T X T HXδ, where H = H(X, θ 1,, θ n ) = diag{b (x T 1 θ 1 ),,b (x T n θ n )} is an n n diagonal matrix with θ i R p lying on the line segment adoining θ and θ 0, i = 1,,n. Combining (A.13) and (A.14) and rearranging terms yield (A.15) (n) 1 δ T X T HXδ +pen( θ) n 1 ε T Xδ +pen(θ 0 ),

7 INNOVATED INTERACTION SCREENING 7 where ε = y E(y X) = y b (Xθ 0 ). The right hand side of the above inequality can be bounded as (A.16) n 1 ε T Xδ +pen(θ 0 ) n 1 ε T X δ 1 +pen(θ 0 ). By Condition 5, we have δ T X T HXδ C Xδ for some positive constant C, which depends on the constant π min in Condition 5. This inequality, together with (A.15) and (A.16), completes the proof. A.4. Lemma 3 and its proof. Lemma 3. Assume that Condition 1 holds. If log(p) = o(n), then with probability 1 O(p c 1 ), we have n 1 ε T X 1 c 0 log(p)/n, where c0 is some positive constant and ε = y E(y X) with y = ( 1,, n ) T. Proof. Recall that X = (x 1,,x n ) T = ( x 1,, x p ). An application of the Bonferroni inequality gives that (A.17) P( n 1 ε T X > λ 0 ) p =1 P( n 1 ε T x > λ 0 ) for any λ 0. The key idea is to bound P( n 1 ε T x > λ 0 ). To this end, consider the following three cases. Case 1: = 1. In this case, x = 1, where 1 is a n-dimensional vector with each component being one. Recall that ε = (ε 1,,ε n ) T with ε i = i E( i x i ) and i {0,1}. So 1 ε i 1 for i = 1,,n. Thus, by Hoeffding s inequality [3], we have P( n 1 ε T x > λ 0 ) = P( n 1 ε T 1 > λ 0 ) exp ( nλ 0 /). Case : p + 1. In this case, x = (Z 1, 1,,Z n, 1 ) T. Thus, n 1 ε T x = n 1 n ε iz i, 1. From Lemma 9, under Condition 1, we have that z = (Z 1,,Z p ) T is sub-gaussian, that is, there exist some positive constants a 1 and b 1 such that P( v T z > t) a 1 exp( b 1 t )

8 8 Y. FAN, Y. KONG, D. LI AND Z. ZHENG for any vector v R p satisfying v = 1 and any t > 0. Therefore, for any p + 1, taking v = e 1 being a unit vector with the ( 1) component being one and zero elsewhere in the inequality above gives P( Z 1 > t) a 1 exp( b 1 t ) holds uniformly for all p+1. By Lemma 11, we have E(e b 1Z 1 / ) 1+a 1. This, together with the inequality ab (a +b )/ for any a,b > 0, gives e (b 1/) ε i Z i, 1 e b 1(ε i +Z i, 1 )/4 = e b 1ε i /4 e b 1Z i, 1 /4 (A.18) (e b 1ε i / +e b 1Z i, 1 / )/ (e b 1/ +1+a 1 )/ for all 1 i n and p + 1, where we have used the fact that 1 ε i 1 in the last inequality. Thus, it follows from Lemma 1 that for any 0 < λ 0 1, there exist some positive constants a and b such that P( n 1 ε T x λ 0 ) = P( n 1 ε i Z i, 1 λ 0 ) a exp( b nλ 0) for all p+1. Case 3: p + p. In this case, x = (Z 1,k Z 1,l,,Z n,k Z n,l ) T for some 1 k l p. We can use similar arguments for Case to bound P( n 1 ε T x λ 0 ). Similarly to (A.18), we have e (b 1/) ε i Z ik Z il e (b 1/) Z ik Z il e b 1(Z ik +Z il )/4 = e b 1Z ik /4 e b 1Z il /4 (e b 1Z ik / +e b 1Z il / )/ 1+a 1, for all 1 i n and all 1 k l p. This together with Lemma 1 gives that for any 0 < λ 0 1, there exist some positive constants a 3 and b 3 such that P( n 1 ε T x λ 0 ) = P( n 1 ε i Z ik Z il λ 0 ) a 3 exp( b 3 nλ 0 ) for all p+ p. Combining Cases 1-3 above yields (A.19) P( n 1 ε T x λ 0 ) a 4 exp( b 4 nλ 0) for any 0 λ 0 1, wherea 4 = max{,a,a 3 } and b 4 = min{1/,b,b 3 }. Let λ 0 = 1 c 0 log(p)/n with some positive constant c0. Since log(p) = o(n),

9 INNOVATED INTERACTION SCREENING 9 we have 0 < λ 0 1 for all sufficiently large n. In view of (A.17) and (A.19), we have P( n 1 ε T X > 1 c 0 log(p)/n) pa4 exp{ 4 1 b 4 c 0log(p)} 3a 4 p (4 1 b 4 c 0 ), wherec 0 > 8/b 4 andwehaveusedthefactthat p = 1+p+p(p+1)/ 3p. Thus, we conclude that n 1 ε T X c 0 log(p)/n holds with probability at least 1 O(p c 1 ) with c 1 = 4 1 b 4 c 0 > 0. A.5. Lemma 4 and its proof. Lemma 4. Assume that there exists some constant φ > 0 such that (A.0) δ T Σδ φ δ T Sδ S for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S, where Σ = E(x T x). If both z (1) and z () are sub-gaussian, 5s 1/ + 4λ 1 1 λ θ 0 = O(n ξ/ ), and log(p) = o(n 1/ ξ ) with constant 0 ξ < 1/4, then with probability at least 1 O(p c ), n 1/ Xδ (φ/) δ S holds for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S when n is sufficiently large. Proof. The idea is to show that the desired inequality holds conditioning on an event and the probability of this event occurring is at most O(p c ) with some positive constant c. Conditioning the event E 4 = { n 1 X T X Σ } < C 1 n ξ where positive constant C 1 will be specified later, we have δ T (n 1 X T X Σ)δ < C 1 n ξ δ 1 = C 1n ξ ( δ S 1 + δ S c 1 ) C 1 n ξ (s 1/ δ S + δ S c 1 ), where the last inequality follows from the Cauchy-Schwarz inequality. The following arguments are all conditioning on the event E 4. Thus, for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S, we obtain δ T (n 1 X T X Σ)δ C 1 n ξ (5s 1/ +4λ 1 1 λ θ 0 ) δ S.

10 10 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Since 5s 1/ + 4λ 1 1 λ θ 0 = O(n ξ/ ), there exists some positive constant C such that 5s 1/ +4λ 1 1 λ θ 0 C n ξ/. Thus, δ T (n 1 X T X Σ)δ C 1 C δ S for any δ R p satisfying δ S c 1 4(s 1/ + λ 1 1 λ θ 0 ) δ S. Note that n 1 δ T X T Xδ = δ T (n 1 X T X Σ)δ+δ T Σδ. This, together with the above inequality and the assumption (A.0), yields n 1 δ T X T Xδ C 1 C δ S +δ T Σδ (φ C 1 C ) δ S. Choose C 1 = 3φ /(4C ). Thus, with probability 1 P(Ec 4 ), we have n 1/ Xδ (φ/) δ S for any δ R p satisfying δ S c 1 4(s 1/ +λ 1 1 λ θ 0 ) δ S. It remains to show that P(E4 c) O(p c ) with some positive constant c. For any matrix A, denote by A the entrywise matrix infinity norm of A and (A) kl the (k,l) entry of A. Since both z (1) and z () are sub-gaussian, by Lemmas 9, 10, 11, and 13, we have P(E4) c = P { n 1 X T X Σ K /(C)n ξ} p p k=1 l=1 P{ (n 1 X T X Σ) kl K /(C )n ξ } p aexp( 4 1 bc K n1/ ξ ) O(p c ) for all n sufficiently large, where a,b, c are positive constants and the last inequality holds since p = 1+p+p(p+1)/ and log(p) = o(n 1/ ξ ). This completes the proof of Lemma 4. A.6. Lemma 5 and its proof. Lemma 5. Assume that w = (W 1,,W p ) T R p is sub-gaussian. Then for any positive constant c 1, there exists some positive constant C such that { } P max W > C log(p) = O(p c 1 ). 1 p Proof. Since w = (W 1,,W p ) T is sub-gaussian, there exist some positive constants c 1 and c such that P( v T w > t) c 1 exp( c t ) for any

11 INNOVATED INTERACTION SCREENING 11 v R p satisfying v = 1 and any t > 0. Taking v be a unit vector with th component 1 and all other components 0 yields P( W > t) c 1 exp( c t ) for all 1 p. An application of the Bonferroni inequality gives P( max 1 p W > t) p P( W > t) p c 1 exp( c t ). =1 If we choose t = C log(p) with some positive constant C = 1+c 1 / c, } then we have P {max 1 p W > C log(p) = O(p c 1 ). This completes the proof of Lemma 5. APPENDIX B: PROOFS FOR SECONARY LEMMAS B.1. Lemma 6 and its proof. Lemma 6. Under Condition, it holds that τ 1,p λ min [cov( z)] λ max [cov( z)] τ,p, where τ 1,p = {πτ 1,p +(1 π)τ 1τ,p } 1 and τ,p = {πτ 1 1 +(1 π)τ 1 τ,p+ π(1 π)τ 1 µ 1 } exp(1). Proof. In order to prove Lemma 6, we will first calculate the covariance matrix of z = z (1) + (1 )z (). Recall that µ = E(z () ) = 0 and is a Bernoulli variable taking value 1 with probability π. We will apply the formula cov(z) = cov[e(z )] + E[cov(z )] to calculate the covariance matrix of z. For the first term cov[e(z )], we can calculate it as E(z ) = E(z (1) )+(1 )E(z () ) = µ 1, which gives cov[e(z )] = cov( )µ 1 µ T 1 = π(1 π)µ 1µ T 1. For the second term E[cov(z )], since =, (1 ) = 1 and (1 ) = 0, we have cov(z ) = E(zz T ) E(z )E(z ) T = {E(z (1) z (1)T ) µ 1 µ T 1}+(1 ) E(z () z ()T ) = Σ 1 +(1 )Σ.

12 1 Y. FAN, Y. KONG, D. LI AND Z. ZHENG After taking expectation on both sides, we get E[cov(z )] = πσ 1 +(1 π)σ. Thus we have cov(z) = πσ 1 +(1 π)σ +π(1 π)µ 1 µ T 1. Recall that z = Ω 1 z. It follows that cov( z) = Ω 1 cov(z)ω 1 = πω 1 +(1 π)ω 1 Σ Ω 1 +π(1 π)ω 1 µ 1 µ T 1Ω 1. Therefore, under Condition, we get λ max [cov( z)] πλ max (Ω 1 )+(1 π)λ max (Ω 1 Σ Ω 1 )+π(1 π)λ max (Ω 1 µ 1 µ T 1 Ω 1) πτ 1 1 +(1 π)τ 1 τ,p +π(1 π)τ 1 µ 1 ; λ min [cov( z)] πλ min (Ω 1 )+(1 π)λ min (Ω 1 Σ Ω 1 )+π(1 π)λ min (Ω 1 µ 1 µ T 1 Ω 1) πτ 1,p +(1 π)τ 1τ,p. It completes the proof of Lemma 6. B.. Lemma 7 and its proof. Lemma 7. Under model setting () and conditions in Proposition 1, for sufficiently large n, with probability at least 1 pexp( C τ 1,p n1 κ ), it holds that max 1 p σ /σ 1 cn κ /8, for some positive constant C, where c and κ are constants defined in Condition 3. Proof. Wewillfirstdecompose σ σ intoseveral terms,andthenprove deviation bounds for each term. Denote Ω 1 by the operator norm of Ω 1. Note that under Condition, Ω 1 is bounded from above by constant τ 1 1. So z(k) = Ω 1 z (k) for k = 1, are also sub-gaussian distributed. Recall that Z = ZΩ 1 is the transformed data matrix. Denote by Z = ( z i ) n p. Then z i are independent and identically distributed across i with mixture sub-gaussian distribution and variance σ. Since σ is the pooled sample variance estimate for the th transformed feature Z, we have σ = ( z i z ) /n,

13 INNOVATED INTERACTION SCREENING 13 where z = n z i/n is the pooled sample mean estimate for Z. Let µ = E( z i ) and µ (1) = E( z (1) i ). It is clear that µ = π µ (1). By some simple calculation, we have the following decomposition for σ σ, σ σ = ([ z i µ ] σ )/n ( z µ ). Since z i = i z (1) i +(1 i ) z () i, we have z i µ = i ( z (1) i µ (1) )+(1 i ) z () i +( i π) µ (1), where z (1) i µ(1) and z () i are sub-gaussian distributed with mean 0 and variances (σ (1) ) and (σ () ), respectively. By the proof of Lemma 6, we know that σ = π(σ(1) ) + (1 π)(σ () ) + π(1 π)( µ (1) ). As i = i, (1 i ) = 1 i and i (1 i ) = 0, by replacing z i with i z (1) i +(1 i) z () i and spreading out the terms, we can further decompose σ σ as σ σ =1 n { i [( z (1) i ) (σ (1) ) ]+(1 i )[( z () i ) (σ () ) ] +[( i π) π(1 π)]( µ (1) ) +( i π)[(σ (1) ) (σ () ) ] +( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ]} ( z µ ). Let S = {1 i n : i = 1} and ε n be any positive sequence such that ε n 0 and ε n τ 1,p 0 as n. It follows from the above decomposition

14 14 Y. FAN, Y. KONG, D. LI AND Z. ZHENG that (A.1) P( σ σ > ε nσ /) P(1 n i S +P( 1 n i S c [( z () i ) (σ () ) ] > ε nσ 1 ) [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) +P( 1 n [( i π) π(1 π)]( µ (1) ) > ε nσ 1 ) +P( 1 n ( +P( 1 n i nπ)[(σ (1) ) (σ () ) ] > ε nσ 1 ) ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) +P{( z µ ) > ε nσ 1 }. We will bound the six terms on the right hand side above one by one using some deviation results. Thesame notation C will beusedto denote a generic positive constant without loss of generality. Recall that n 1 = n i and n = n n 1. By Lemma 6, σ are uniformly boundedfrombelowby τ 1,p.Thus,bythedeviationofsub-Gaussianin(A.1), conditioning on any realization of = ( 1,, n ) T, we obtain P( 1 n i S P( 1 n 1 i S exp( Cε n τ 1,p [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) [( z (1) i ) (σ (1) ) ] > ε n τ 1,p 1 ) Cexp( ρε n τ 1,pn 1 /1 ) i ). Since ε n τ 1,p 0 as n and n i is a Binomial random variable with probability of success π, for sufficiently large n such that ε n τ 1,p is small

15 INNOVATED INTERACTION SCREENING 15 enough, taking expectation on both sides above yields (A.) P( 1 n i S [( z (1) i ) (σ (1) ) ] > ε nσ 1 ) E{exp( Cε n τ 1,p = {1 π +πexp( Cε n τ 1,p )}n = exp{nln[1 π +πexp( Cε n τ 1,p )]} i )} exp{ nπ[1 exp( Cε n τ 1,p )]} exp( nπcε n τ 1,p ) = exp( Cε n τ 1,p n), where we have used the inequalities that log(1+x) x for any x > 1, and 1 exp( x) x for sufficiently small x > 0. This gives an upper bound on the first term in (A.1). Similar to (A.), the second term in (A.1) can be bounded as (A.3) P( 1 n i S c [( z () i ) (σ () ) ] > ε nσ 1 ) exp( Cε n τ 1,p n). As z (1) is sub-gaussian distributed, it follows from Lemma 11 that µ (1) are uniformly bounded from above by some positive constant across. Since i are independently Bernoulli distributed with success probability π, we know that E{( i π) } = π(1 π) and ( i π) are i.i.d. and bounded from above by 1. By Hoeffding s inequality, we have the following bound for the third term in (A.1), (A.4) P( 1 n [( i π) π(1 π)]( µ (1) ) > ε nσ 1 ) P( 1 n ( i π) π(1 π) > ε n τ 1,p ε 1( µ (1) exp( n τ 1,p n ) ) 1 ( µ (1) ) 4) exp( Cε n τ 1,pn). Due to the fact that σ = π(σ(1) ) + (1 π)(σ () ) + π(1 π)( µ (1) ), we have σ π(1 π)[(σ (1) ) + (σ () ) ]. Similar to (A.4), by applying

16 16 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Hoeffding s inequality, the fourth term in (A.1) can be bounded as (A.5) P( 1 n ( i nπ)[(σ (1) ) (σ () ) ] > ε nσ 1 ) P( 1 σ i π [ n π(1 π) ] > ε nσ 1 ) P( 1 n exp( nπ (1 π) ε n 1 ) exp( Cε n τ 1,p n). For the fifth term in (A.1), we first decompose it as (A.6) P( 1 n i π > π(1 π)ε n ) 1 ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) P( 1 n (1 π)( z (1) i ) π z () i ] > ε nσ i S i S c 1 µ (1) ) P( 1 n (1 π)( z (1) i ) > ε nσ i S 4 µ (1) )+P( 1 n π z () i > ε nσ i S c 4 µ (1) ). Recall that σ π(σ(1) ) and max 1 p µ (1) can be bounded from above by some positive constant. Applying Bernstein s inequality to the sum of independent sub-gaussian random variables yields P( 1 n (1 π)( z (1) i ) > ε nσ i S 4 µ (1) ) P( 1 ( z (1) i ) πεn σ > n 1 i S σ (1) 48(1 π) µ (1) ) πε exp( n τ 1,pn 1 96 (1 π) ( µ (1) ) ) exp( Cεn τ 1,p i ). Applying the same argument as in (A.), taking expectation on both sides above then we can obtain P( 1 n i S (1 π)( z (1) i ) > ε nσ 4 µ (1) ) exp( Cε n τ 1,p n).

17 INNOVATED INTERACTION SCREENING 17 Similarly we have ( 1 P n π z () i > ε ) nσ i S c 4 µ (1) exp( Cε n τ 1,pn). In view of (A.6), the two bounds we obtained yield (A.7) P( 1 n ( i π) µ (1) [ i ( z (1) i )+(1 i ) z () i ] > ε nσ 1 ) exp( Cε n τ 1,pn), which is the upper bound for the fifth term in (A.1). Applying Bernstein s inequality similarly to the sixth term in (A.1) gives P{( z µ ) > ε nσ 1 } P( z µ εn > σ 1 ) exp( ε nn 4 ). Combining the six bounds for the terms in (A.1) that we have obtained yields (A.8) P( σ /σ 1 > ε n /) P( σ σ > ε n σ/) exp( Cε n τ 1,pn). It follows that (A.9) P(max 1 p σ /σ 1 > ε n/) P( σ /σ 1 > ε n/) 1 p pexp( Cε n τ 1,p n). Let ε n = cn κ /4 with constants c and κ defined in Condition 3. Since τ 1,p 1, we know that ε n τ 1,p 0 as n. Thus we can replace ε n with cn κ /4 in (A.9) and get (A.30) P( max 1 p σ /σ 1 > cn κ /8) pexp( C τ 1,pn 1 κ ). This completes the proof of Lemma 7. B.3. Lemma 8 and its proof. Lemma 8. Under model setting() and Condition, for sufficiently large n, with probability at least 1 p exp( C n), it holds that (Z Z) T (Z Z) max /n c 3 τ,p for some positive constants C and c 3 >, where Z = 1 n ( z 1,, z p ) with z = n z i/n and 1 n the n 1 column vector with all components 1.

18 18 Y. FAN, Y. KONG, D. LI AND Z. ZHENG Proof. Thedeviation boundof (Z Z) T (Z Z) max /n can beobtained by bounding each component of (Z Z) T (Z Z). Recall that µ = E( z ) = πµ (1) with µ (1) = E(z (1) i ). Note that the th diagonal component of (Z Z) T (Z Z)/n can be bounded as (A.31) (z i z ) /n = (z i µ ) /n ( z µ ) (z i µ ) /n, Let Z = (z i ) n p, then the components of each row in Z are independent and identically distributed (i.i.d.) with a mixture sub-gaussian distribution which can be written as z i = i z (1) i + (1 i )z () i, where i are i.i.d. Bernoulli random variables. Denote by Σ 1 = (σ (1) i ) p p and Σ = (σ () i ) p p. Then for each = 1,,p, the random variables z (1) i µ (1) and z () i are independent across i with mean 0 and variances σ (1) and σ (), respectively. We will then bound n (z i µ ) /n by some deviation results. The same notation C will be used to denote a generic constant without loss of generality. For any 1 i n, we have the following decomposition for n (z i µ ) /n, (z i µ ) /n = n 1 { i (z (1) (A.3) i µ (1) ) +(1 i )(z () i ) +( i π) (µ (1) +( i π)µ (1) [ i (z (1) i µ (1) )+(1 i )(z () i )]}. Recall that S = {1 i n : i = 1}, n 1 = n i and n = n n 1. By Condition, both σ (1) and σ () can be uniformly bounded from above by τ,p across. For the first term in (A.3) and any positive constant c 3, applying the argument with sub-gaussian deviation and Taylor expansion similarly as in (A.) gives P(n 1 { i (z (1) P(n 1 i S i µ (1) ) } > τ,p +c 3 ) P(n 1 i S [(z (1) i µ (1) ) σ (1) ] > c 3) exp( C n). Similarly, we have for the second term in (A.3), P(n 1 (1 i )(z () i ) > τ,p +c 3 ) exp( C n). ) (z (1) i µ (1) ) > σ (1) +c 3)

19 INNOVATED INTERACTION SCREENING 19 Since µ (1) are uniformly bounded from above by some positive constant across from Lemma 11, similar to (A.4), by Hoeffding s inequality, we have the following bound for the third term in (A.3), P{n 1 ( i π) (µ (1) ) > π(1 π)(µ (1) ) +c 3 } exp( C n). For the last term in (A.3), similar to (A.7), by Bernstein s inequality we have P{n 1 ( i π)µ (1) [ i (z (1) i µ (1) )+(1 i )(z () i )] > c 3} exp( C n). In view of (A.3), by the similar argument as in (A.1), combining the four bounds we have obtained gives P( (z i µ ) /n > τ,p +C 3 ) exp( C n), where C 3 = π(1 π)(µ (1) ) + 4c 3. As shown in (A.31), n (z i z ) /n can be bounded from above by n (z i µ ) /n. Thus we have P( (z i z ) /n > τ,p +C 3 ) exp( C n). Then for the (i,)th component of (Z Z) T (Z Z), by the Cauchy- Schwarz inequality, it follows that P( (z ki z i )(z k z )/n > τ,p +C 3 ) k=1 P{[n 1 (z ki z i ) ][n 1 (z k z ) ] > [τ,p +C 3 ] } k=1 k=1 P{n 1 (z ki z i ) > τ,p +C 3 }+P{n 1 (z k z ) > τ,p +C 3 } k=1 exp( C n). Sincetheaboveinequalityholdsforany(i,)th componentof(z Z) T (Z Z), similar to (A.9), we conclude that k=1 P( (Z Z) T (Z Z) max /n > τ,p +C 3 ) p exp( C n)

20 0 Y. FAN, Y. KONG, D. LI AND Z. ZHENG As τ,p is allowed to diverge, we can choose some constant c 3 > such that c 3 τ,p > τ,p +C 3, which gives (A.33) P( (Z Z) T (Z Z) max /n > c 3 τ,p ) p exp( C n). It completes the proof of Lemma 8. B.4. Lemma 9 and its proof. Lemma 9. Assume that z (1) R p and z () R p are sub-gaussian, and follows a Bernoulli distribution with probability of success π. Let z = z (1) +(1 )z (). Then z is also sub-gaussian. Proof. Since z (1) R p is sub-gaussian, there exists some positive constants a 1 and b 1 such that P( v T z (1) > t) a 1 exp( b 1 t ) for any vector v R p satisfying v = 1 and any t > 0. Similarly, there exists some positive constants a and b such that P( v T z () > t) a exp( b t ) for any vector v R p satisfying v = 1 and any t > 0 since z () R p is sub-gaussian. Let a 3 = max{a 1,a } and b 3 = min{b 1,b }. Then we have P( v T z (k) > t) a 3 exp( b 3 t ) for k = 1,. This, together with the law of total probability, yields P( v T z > t) =P( v T z > t = 1)P( = 1)+P( v T z > t = 0)P( = 0) =πp( v T z (1) > t)+(1 π)p( v T z () > t) a 3 exp( b 3 t ). Thus z is sub-gaussian. B.5. Lemma 10 and its proof. Lemma 10. Ifarandomvectorw=(W 1,,W p ) T R p issub-gaussian, then so is w E(w). Proof. If w is sub-gaussian, then there exist some positive constants a 1 and b 1 such that P( v T w > t) a 1 exp( b 1 t )

21 INNOVATED INTERACTION SCREENING 1 for any vector v R p satisfying v = 1 and any t > 0. Let µ = E(w). Thus from the Cauchy-Schwarz inequality and Lemma 11, we have (v T µ) = E (v T w) E[(v T w) ] (1+a 1 )c 1 with c = b 1 /. Note that ( a b ) ( a + b ) (a +b ) for any a and b. Thus, for any vector v R p satisfying v = 1 and any t > 0, P{ v T [w E(w)] > t} P{ v T w + v T µ > t} P{(v T w) +(v T µ) > t } P{e c(vt w) > e ct / c(v T µ) } e ct /+c(v T µ) E[e c(vt w) ] (1+a 1 )e 1+a1 e ct / = a exp( b t ) with a = (1 + a 1 )e 1+a 1 and b = c/ = b 1 /4. Thus w E(w) is sub- Gaussian. B.6. Lemma 11 and its proof. Lemma 11. Let W be a nonnegative random variable such that P(W > t) aexp( bt ) for any t > 0, where a and b are positive constants. Then E[exp( 1 bw )] 1 + a and E(W m ) (1 + a)(/b) m m! for any integer m 0. Proof. Denote by F(t) the cumulative distribution function of W. Then for all x > 0, we have 1 F(t) = P(W > t) aexp( bt ). For any constant 0 < c < b, integration by parts yields E(e cw ) = e ct d[1 F(t)] = 1+ cate (b c)t dt = 1+ ca b c. Thus letting c = b/ gives E[exp( 1 bw )] 1+a. Note that m b m E(W m ) m! k=0 0 cte ct [1 F(t)]dt E( 1 bw ) k = E[exp( 1 bw )] 1+a. k! This implies E(W m ) (1+a)(/b) m m! for any integer m 0. This completes the proof of Lemma 11.

22 Y. FAN, Y. KONG, D. LI AND Z. ZHENG B.7. Lemma 1. Lemma 1 (Lemma 8 of Hao and Zhang []). Let W 1,,W n be independent random variables with zero mean. If E[exp(T 0 W i α )] c 1 for some constants T 0 > 0, c 1 > 0 and 0 < α 1, then there exist positive constants c and c 3 such that for any 0 < ε 1. P( n 1 W i > ε) c exp( c 3 n α ε ) B.8. Lemma 13 and its proof. Lemma 13. Assume that max 1 p E[exp( c 1 Z1 )] c holds with some positive constants c 1 and c. If for each 1 p, the random variables Z 1,,Z n are independent and identically distributed, then { } P n 1 [Z i E(Z i )] ε c 3 exp( c 4 nε ) { } P n 1 [Z i Z ik E(Z i Z ik )] ε c 3 exp( c 4 nε ) { } P n 1 [Z i Z ik Z il E(Z i Z ik Z il )] ε c 3 exp( c 4 n /3 ε ) { } P n 1 [Z ik Z il Z ik Z il E(Z ik Z il Z ik Z il )] ε c 3 exp( c 4 n 1/ ε ) for any 0 < ε < 1, where 1,k,l,k,l p, and c 3 and c 4 are generic positive constants which may vary from line to line. Proof. The idea of the proof is to use Lemma 1 and similar to that for Lemma 9 in Hao and Zhang []. So we omit the details here. REFERENCES [1] Cai, T., Zhang, C.-H., and Zhou, H.(008). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38, [] Hao, N. and Zhang, H. H. (014). Interaction Screening for Ultra-High Dimensional Data. J. Amer. Statist. Assoc., to appear. [3] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58,

23 INNOVATED INTERACTION SCREENING 3 Data Sciences and Operations Department Marshall School of Business University of Southern California Los Angeles, CA USA fanyingy@marshall.usc.edu daoili@marshall.usc.edu Department of Mathematics University of Southern California Los Angeles, CA USA zeminzhe@usc.edu Department of Preventive Medicine Keck School of Medicine University of Southern California Los Angeles, CA USA yinfeiko@usc.edu

SUPPLEMENTARY MATERIAL TO INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION

IPDC SUPPLEMENTARY MATERIAL TO INTERACTION PURSUIT IN HIGH-DIMENSIONAL MULTI-RESPONSE REGRESSION VIA DISTANCE CORRELATION By Yinfei Kong, Daoji Li, Yingying Fan and Jinchi Lv California State University