Supplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang

Size: px

Start display at page:

Download "Supplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang"

Sheena Berry
5 years ago
Views:

1 Supplementary Materials: Martingale difference correlation and its use in high dimensional variable screening by Xiaofeng Shao and Jingsi Zhang The supplementary material contains some additional simulation results in Section 8 and proofs of Theorems 3-6 in Section 9. For the sake of readership and completeness, we also provide a brief description of the model setting. 8 Additional Simulation Results 8.1 Example 1 We adopt the simple linear model from Fan and Lv (2008): Y = 5X 1 + 5X 2 + 5X 3 + ϵ. The predictor vector (X 1,..., X p ) is drawn from a multivariate normal distribution N(0, Σ) whose covariance matrix Σ = (σ ij ) p p has entries σ ii = 1, i = 1,..., p, and σ ij = ρ, i j. The error term ϵ is independently generated from the standard normal distribution. We consider several different combinations of (p, n, ρ), i.e., p = 100, 1000, n = 20, 50, 70 and ρ = 0, 0.1, 0.5,

2 Table 11: P a for Example 1 with d = 5, 10, 15 and p = 1000 d p n Method Results for the following values of ρ: ρ = 0 ρ = 0.1 ρ = 0.5 ρ = SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS

3 Table 12: P a for Example 1 with d = n under several SNRs (signal noise ratios) SNR p n Method Results for the following values of ρ: ρ = 0 ρ = 0.1 ρ = 0.5 ρ = 0.9 SNR= SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SNR= SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SNR= SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS

4 Table 13: P a for Example 1 with p = 3000 and d = n/log(n) p n Method Results for the following values of ρ: ρ = 0 ρ = 0.1 ρ = 0.5 ρ = SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS SIS DC-SIS MDC-SIS SIRS

5 8.2 Example 2 In this example, we consider two nonlinear additive models, which have been analyzed in Meier, Geer, and Bühlmann (2009) and Fan, Feng and Song (2011). Let g 1 (x) = x, g 2 (x) = (2x 1) 2, g 3 (x) = sin(2πx)/(2 sin(2πx)), and g 4 (x) = 0.1sin(2πx) + 0.2cos(2πx) + 0.3sin 2 (2πx) + 0.4cos 3 (2πx) + 0.5sin 3 (2πx). The following cases are studied: Case 2.a: Y = 5g 1 (X 1 ) + 3g 2 (X 2 ) + 4g 3 (X 3 ) + 6g 4 (X 4 ) ϵ, where the covariates X j, j = 1,..., p are simulated according to iid Unif(0,1), and ϵ is independent from the covariates and follows the standard normal distribution. Case 2.b: The covariates and the error term are simulated as in Case 2a, but the model structure is more involved with 8 additional predictor variables. Y = g 1 (X 1 ) + g 2 (X 2 )+g 3 (X 3 )+g 4 (X 4 )+1.5g 1 (X 5 )+1.5g 2 (X 6 )+1.5g 3 (X 7 )+1.5g 4 (X 8 )+2g 1 (X 9 )+ 2g 2 (X 10 ) + 2g 3 (X 11 ) + 2g 4 (X 12 ) ϵ. Table 14: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Example 2 with p = 2000 and n = 200. Method 5% 25% 50% 75% 95% SIS DC-SIS a MDC-SIS NIS SIRS SIS DC-SIS b MDC-SIS NIS SIRS

6 Table 15: The proportions of P s and P a for Example 2 with d = n/logn, p = 2000 and n = 200. Method X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 ALL SIS DC-SIS a MDC-SIS NIS SIRS SIS DC-SIS b MDC-SIS NIS SIRS Example 4 This example consists of three cases: Case 4.a Y = X X X X X 5 + exp(x 20 + X 21 + X 22 ) ϵ. Case 4.b Y = X sin( X 2 )+0.6 exp( X 3 )+0.4X X 5 +exp(x 20 +X 21 +X 22 ) ϵ. Case 4.c Y = X 1 X X X X 5 + exp(x 20 + X 21 + X 22 ) ϵ. In the above models, the error ϵ N(0, 1) and is independent from the covariates. The predictor vector follows the multivariate normal distribution with the correlation structure described in Example 1 but with ρ = 0.8. All models in this example are heteroscedastic with the number of active variables being 5 at the median (i.e., τ = 0.5) but 8 for other τs. Case 4.a is adapted from an example used in Zhu et al. (2011). Cases 4.b and 4.c are modified versions of Case 4.a by including nonlinear structure and interaction terms. We report S, P s and P a with d = n/logn for all three methods in Tables 6 and 7. Tables below report the minimum model size and the proportions of P s and P a for Cases 4a, 4b and 4c with varying degree of signal-to-noise ratios. 46

7 Table 16: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Case 4.a with different SNR Settings τ Method 5% 25% 50% 75% 95% Case 4.a (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.a (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

8 Table 17: The proportions of P s and P a for Case 4.a with d = n/logn and different SNR Settings τ Method X 1 X 2 X 3 X 4 X 5 X 20 X 21 X 22 ALL Case 4.a (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.a (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

9 Table 18: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Case 4.b with different SNR Settings τ Method 5% 25% 50% 75% 95% Case 4.b (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.b (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

10 Table 19: The proportions of P s and P a for Case 4.b with d = n/logn and different SNR Settings τ Method X 1 X 2 X 3 X 4 X 5 X 20 X 21 X 22 ALL Case 4.b (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.b (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

11 Table 20: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Case 4.c with different SNR Settings τ Method 5% 25% 50% 75% 95% Case 4.c (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.c (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

12 Table 21: The proportions of P s and P a for Case 4.c with d = n/logn and different SNR Settings τ Method X 1 X 2 X 3 X 4 X 5 X 20 X 21 X 22 ALL Case 4.c (c = 0.5) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS Case 4.c (c = 2) SISQ MDC-SISQ QaSIS n=100 SISQ MDC-SISQ QaSIS DC-SIS SIRS SISQ MDC-SISQ QaSIS n=200 SISQ MDC-SISQ QaSIS DC-SIS SIRS

13 8.4 Examples in He, Wang and Hong (2013) Example HWH1: (additive model, n=400, p=1000). This example is adapted from Fan et al. (2011). Let g 1 (x) = x, g 2 (x) = (2x 1) 2, g 3 (x) = sin(2πx)/(2 sin(2πx)), and g 4 (x) = 0.1sin(2πx) + 0.2cos(2πx) + 0.3sin(2πx) cos(2πx) sin(2πx) 3. The following cases are studied: Case 1a: Y = 5g 1 (X 1 )+3g 2 (X 2 )+4g 3 (X 3 )+6g 4 (X 4 )+ (1.74)ϵ, where the vector of covariates X is generated from the multivariate normal distribution N(0,Σ) with σ ij = ρ i j. In Case 1a, we consider ρ = 0; Case 1b: same as Case 1a except that ρ = 0.8; Case 1c: same as Case 1b except that ϵ Cauchy. Example HWH2: (index model, n=200,p=2000). This example is adapted from Zhu et al. (2011). The random data are generated from Y = 2(X X X X X 5 ) + exp(x 20 + X 21 + X 22 ) ϵ. This model is heteroscedastic: the number of active variables is 5 at the median but 8 elsewhere. Example HWH3: (a more complex structure, n=400,p=5000). Case 3a: Y = 2(X X 2 2) + exp((x 1 + X 2 + X 18 + X X 30 )/10) ϵ, where ϵ N(0, 1), and X follows the multivariate normal distribution with the correlation structure described in Case 1b. In this case, the number of active variables is 2 at the median but is 15 elsewhere. Case 3b: Same as Case 3a, but with 2(X X 2 2) replaced by 2((X 1 + 1) 2 + (X 2 + 2) 2 ). 53

14 Table 22: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Example HWH1 Case τ Method 5% 25% 50% 75% 95% SISQ MDC-SISQ QaSIS Case 1.a SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS SISQ MDC-SISQ QaSIS Case 1.b SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS SISQ MDC-SISQ QaSIS Case 1.c SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS

15 Table 23: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Example HWH2 τ Method 5% 25% 50% 75% 95% SISQ MDC-SISQ QaSIS SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS

16 Table 24: The 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size S for Example HWH3 τ Method 5% 25% 50% 75% 95% SISQ MDC-SISQ QaSIS Case 3.a SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS SISQ MDC-SISQ QaSIS Case 3.b SISQ MDC-SISQ QaSIS DC-SIS SIRS NIS

17 9 Technical Appendix 2 Proof of Theorem 3: Write MDD 2 (V U) = g V,U (s) g V g U (s) 2 and MDD 2 n(v U) = ξ n (s) 2, where ξ n (s) = 1 n n V ke i<s,u k> 1 n n V k 1 n n ei<s,u k>. After an elementary transformation, ξ n (s) can be expressed as ξ n (s) = 1 n Ũ k Ṽ k 1 n Ũ k 1 n Ṽ k, where Ũk = exp{i < s, U k >} E[exp{i < s, U >}], Ṽk = V k E(V ). Define the region D(δ) = {s : δ s q 1/δ} for each δ > 0, MDDn,δ 2 (V U) = ξ D(δ) n(s) 2 dw, where dw = w(s)ds and w(s) = 1. For any fixed δ > 0, the weight function w(s) is bounded c q s 1+q q on D(δ). Hence MDDn,δ 2 (V U) is a combination of V-statistics with finite expectation. By the SLLN for V-statistics, it follows that almost surely lim n MDD2 n,δ(v U) = MDD 2,δ(V U) = g U,V (s) g U (s)g V 2 dw. Obviously MDD 2,δ (V U) converges to MDD2 (V U) when δ tends to zero. Now, it remains to show that lim sup δ 0 For each δ > 0, MDDn(V 2 U) MDDn,δ(V 2 U) = D(δ) lim sup MDDn(V 2 U) MDDn,δ(V 2 U) = 0. n s q <δ ξ n (s) 2 dw + s q > 1 δ ξ n (s) 2 dw (8) For z = (z 1, z 2,..., z q ) R q, define the function G(y) = 1 cos z 1 z q dz. Clearly G(y) is <y z 1+q q bounded by c q and lim y 0 G(y) = 0. Applying the Cauchy-Schwarz inequality, we obtain ξ n (s) 2 4 n Ũk 2 1 n Hence the first summand in (8) satisfies that ξ n (s) 2 dw 4 Ũk 2 n c q s 1+q ds 1 q n s q <δ 4 n 2 s q <δ Ṽk 2. (9) Ṽk 2 E U { U k U q G( U k U q δ)} 2 n 57 ( V k 2 + E(V 2 )),

18 where we used the inequalities a b 2 2(a 2 + b 2 ) for a, b R and (E(V )) 2 E(V 2 ) as well as the fact that Ũk 2 s q ds 2E <δ c q s 1+q U { U k U q G( U k U q δ)}, as presented in q Székely et al. (2007), page By the SLLN, lim sup ξ n (s) 2 dw 8E{ U 1 U 2 q G( U 1 U 2 q δ)} 4E(V 2 ) a.s. n s q<δ Therefore, by the Lebesgue dominated convergence theorem, lim sup lim sup ξ n (s) 2 dw = 0 a.s. δ 0 n s q <δ Now, consider the second summand in (8). Using the fact that Ũk 2 4 and the inequality (9) again, we can derive that ξ n (s) 2 dw 16 1 s q> 1 n δ s q> 1 c δ q s 1+q ds 1 V k E(V ) 2 q n 16h(δ) 2 {Vk 2 + E(V 2 )}, n where h(δ) = 1 s q> ds goes to zero as δ 0; compare page 2778 of Székely et al. 1 δ c q s 1+q q (2007). Thus, almost surely lim sup δ 0 lim sup n s q> ξ 1 n (s) 2 dw = 0, which implies δ that MDD n (V U) a.s. MDD(V U). The consistency of MDC n (V U) follows from the fact that V ar n (V ) V ar(v ) (SLLN) and dv ar n (U) a.s. dv ar(u) (Theorem 2 in Székely et al. (2007)). The proof is then complete. Proof of Theorem 4: The argument is similar to that presented in the proofs of Theorem 5 and Corollary 2 of Székely et al. (2007). (a), Define the process Γ n (s) = nξ n (s) = n(g n U,V (s) gu n (s)gn V ). After some straightforward calculation, we can derive that E[Γ n (s)] = 0 E[Γ n (s)γ n (s 0 )] = ( n 1 n )2 F (s s 0 ) + n 1 n g U(s s 0 )[ 1 n E(V 2 ) (EV ) 2 ] In particular, + n 1 n [(EV )2 + n 2 n E(V 2 )]g U (s)g U (s 0 ) ( n 1 n )2 F (s)g U (s 0 ) ( n 1 n )2 g U (s)f (s 0 ). E Γ n (s) 2 = n 1 n E(V 2 )(1 + n 2 n g U(s) 2 ) n 1 n (EV )2 (1 g U (s) 2 ) ( n 1 n )2 [F (s)g U (s) + g U (s)f (s)]. 58

19 In the sequel, we construct a sequence of random variables {Q n (δ)}, such that (i) Q n (δ) D Q(δ) for each δ > 0, ; (ii) lim sup E Q n (δ) Γ n 2 0 as δ 0; n (iii) E Q(δ) Γ 2 0 as δ 0. Then the weak convergence of Γ n 2 to Γ 2 follows from Theorem of Resnick (1999). Following the construction in Székely et al. (2007), we define Q n (δ) = Γ n (s) 2 dw and Q(δ) = Γ(s) 2 dw. D(δ) Given ϵ = 1/p > 0, p N, choose a partition {D k } N of D(δ) into N = N(ϵ) measurable sets with diameter at most ϵ. Then Q n (δ) = N D k Γ n (s) 2 dw and Q(δ) = N D k Γ(s) 2 dw. Define Q p n(δ) = N D k Γ n (s 0 (k)) 2 dw and Q p (δ) = N D k Γ(s 0 (k)) 2 dw, where {s 0 (k)} N are a set of distinct points such that s 0(k) D k. By multivariate CLT and continuous mapping theorem, Q p n(δ) D Q p (δ), for any p N. Then in view of Theorem of Resnick (1999), (i) holds if we can show and lim sup p D(δ) lim sup E Q p (δ) Q(δ) = 0 (10) p lim sup E Q p n(δ) Q n (δ) = 0. (11) n Let β n (ϵ) = sup s,s0 E Γ n (s) 2 Γ n (s 0 ) 2 and β(ϵ) = sup s,s0 E Γ(s) 2 Γ(s 0 ) 2, where the supremum is taken over all s and s 0, under the restrictions: δ < s q, s 0 q < 1/δ and s s 0 q < ϵ. In view of the form of Cov Γ (s, s 0 ) (defined after Theorem 3) and by applying the Cauchy-Swartz inequality, we derive that β(ϵ) = sup s,s 0 E (Γ(s) Γ(s 0 ))Γ(s) + Γ(s 0 )(Γ(s) Γ(s 0 )) sup E 1/2 Γ(s) Γ(s 0 ) 2 (E 1/2 Γ(s) 2 + E 1/2 Γ(s 0 ) 2 ) C sup E 1/2 Γ(s) Γ(s 0 ) 2 s,s 0 s,s 0 sup C Cov Γ (s, s) Cov Γ (s, s 0 ) Cov Γ (s 0, s) + Cov Γ (s 0, s 0 ) 1/2. s,s 0 59

20 Since g U (s) and F (s) are uniformly continuous in s R q, it can be easily shown that β(ϵ) 0 as ϵ 0. To show (10), we note that E Q p N (δ) Q(δ) = E Γ(s) 2 dw Γ(s 0 (k)) 2 dw D(δ) D k N = E ( Γ(s) 2 Γ(s 0 (k)) 2 )dw D k 1 β(1/p) c q s 1+q ds 0, as p q Using exactly the same argument, we can show (11) and thus (i) holds. On the other hand, E Γ n (s) 2 dw Γ n (s) 2 dw = E Γ n (s) 2 dw + E Γ n (s) 2 dw D(δ) R q D(δ) s q<δ s q>1/δ Following similar steps as in the proof of Theorem 3, we can derive that for any small ϵ, there exist δ 0, n 0, such that when n n 0 and δ δ 0, E Γ s q<δ n(s) 2 dw < ϵ and E Γ s q >1/δ n(s) 2 dw < ϵ. Thus, we complete our proof for (ii). A similar argument also applies to Q(δ), so (iii) holds. Therefore nmddn(v 2 U) = Γ n 2 D n Γ 2. (b), According to the first assertion, under the assumption that M DC(V U) = 0, nmddn(v 2 U) converges in distribution to a quadratic form Γ 2. Note that E Γ 2 = Cov Γ (s, s)dw R q = {[E(V 2 ) (EV ) 2 ](1 g U (s) 2 ) + 2E(V 2 ) g U (s) 2 F (s)g U (s) g U (s)f (s)}dw R q Under the assumption that E(V 2 U) = E(V 2 ), F (s) = E(V 2 )g U (s), which implies that E Γ 2 = E U U q [E(V 2 ) (EV ) 2 a.s. ]. By the SLLN for V-statistics, S n E U n U q [E(V 2 ) (EV ) 2 ]. Therefore nmdd 2 n(v U)/S n D n Q, where E[Q] = 1 and Q is a nonnegative quadratic form of centered Gaussian random variable following the argument in the proof of Corollary 2 of Székely et al. (2007). 60

21 (c), Suppose that MDD(V U) > 0, then Theorem 3 implies that MDD 2 n(v U) MDD 2 (V U) > 0, and therefore nmdd 2 n(v U) P n a.s. n. By the SLLN, S n converges to a constant and therefore nmddn(v 2 U)/S n. n Proof of Theorem 5: Our argument basically follows that in the proof of Theorem 1 of Li, Zhong and Zhu (2012) with a slight modification. For the sake of completeness, we present the details. In our proof, the positive constant C is generic and its value may vary from place to place. We shall first show the uniform consistency of ω j = (MDCn) j 2 under the assumption (A1). Due to the similarity of its numerator and denominator, we only deal with its numerator, i.e., the uniform consistency of (MDDn) j 2. Let S j 1 = E[Y Y X j X j ], S j 2 = E[Y Y ]E[ X j X j ] and S j 3 = E[Y Y X j X j ], where (X j, Y ) and (X j, Y ) are iid copies of (X j, Y ). Correspondingly, denote their sample counterparts as S j 1n = 1 Y n 2 k Y l X jk X jl, k,l=1 S j 2n = 1 n 2 k,l=1 S j 3n = 1 n 3 k,l,h=1 P Y k Y l 1 n 2 X jk X jl, k,l=1 Y k Y h X jk X jl. According to the proofs of Theorems 1 and 2, MDD j and MDDn j can be expressed as (MDD j ) 2 = S j 1 S j 2 + 2S j 3 and (MDDn) j 2 = S j 1n S j 2n + 2S3n. j We shall establish the consistency result for each part respectively. Part I: Consistency of S j 1n Define a U-statistic S j 1n = {n(n 1)} 1 k l Y ky l X jk X jl with the kernel function h 1 (X jk, Y k ; X jl, Y l ) = Y k Y l X jk X jl. First, we shall show that the uniform consistency of S j 1n can be derived by that of S 1n. j By using the Cauchy-Schwarz inequality, S j 1 = E[Y Y X j X j ] {E[(Y Y ) 2 ] E[ X j X j 2 ]} 1/2 {(E(Y 4 )) 1/2 (E[(Y ) 4 ]) 1/2 4E(Xj 2 )} 1/2 = 2(E(Xj 2 ) E(Y 4 )) 1/2. Under the assumption (A1), sup p max 1 j p S j 1 <, i.e., {S1} j p j=1 are uniformly bounded. Thus, for any ϵ > 0, there exists a sufficiently large n, s.t. S1/n j ϵ for any j = 1,, p (in the case ϵ = cn κ as will be specified later, this still holds). Then, P ( S1n S j 1 j 2ϵ) = P ( n 1( S j n 1n S1) j 1 n Sj 1 2ϵ) P ( S 1n S j 1 + S j 1/n j 2ϵ) P ( S j 1n S1 j ϵ). 61

22 Next, we shall establish the uniform consistency of S j 1n based on the theories of U- statistics. Write S j 1n as S j 1n = {n(n 1)} 1 k l h 1I{ h 1 M}+{n(n 1)} 1 k l h 1I{ h 1 > M} = S j 1n,1 + S 1n,2. j Correspondingly, its population counterpart can also be decomposed as S j 1 = E[h 1 I{ h 1 M}] + E[h 1 I{ h 1 > M}] = S j 1,1 + S1,2. j Note that S j 1n,1 and S j 1n,2 are unbiased estimators of S j 1,1 and S j 1,2 respectively. To show the consistency of S 1n,1, j we note that all U-statistics can be expressed as an average of averages of iid random variables, see Serfling (1980, Section 5.1.6). Denote m = n/2 and define Ω(X j1, Y 1 ; ; X jn, Y n ) = 1 m m 1 r=0 h (r) 1 I{ h (r) 1 M}, where h (r) 1 = h 1 (X j 1+2r, Y 1+2r ; X j 2+2r, Y 2+2r ). Then we have S j 1n,1 = (n!) 1 n! Ω(X ji 1, Y i1 ; ; X jin, Y in ), where n! denote summation over all n! permutations (i 1,, i n ) of (1,, n). By Jensen s inequality, for t > 0, E[exp(t S j 1n,1)] = E[exp{t(n!) 1 n! Ω(X ji1, Y i1 ; ; X jin, Y in )}] (n!) 1 n! m 1 E[exp(t h (r) 1 I{ h (r) 1 M}/m)] r=0 = E m [exp(th (r) 1 I{ h (r) 1 M}/m)], which entails that P ( S j 1n,1 S j 1,1 ϵ) exp( tϵ) exp( ts j 1,1)E[exp(t S j 1n,1)] exp( tϵ) E m {exp[t(h (r) 1 I{ h (r) 1 M} S j 1,1)/m]} exp( tϵ) exp{t 2 M 2 /(2m)} where we have applied Markov s inequality and Hoeffding s inequality (see Lemma 1 of Li, Zhong and Zhu (2012)) in the first and third inequality above, respectively. Choosing t = ϵm/m 2 and utilizing the symmetry of U-statistic, we can obtain that P ( S j 1n,1 S1,1 j ϵ) 2 exp{ ϵ 2 m/(2m 2 )}. Next, we turn to the other part S 1n,2. j With the Cauchy-Schwarz inequality and Markov s inequality, (S1,2) j 2 = (E[h 1 I{ h 1 > M}]) 2 E[h 2 1] P { h 1 > M} E[h 2 1]E[ h 1 q ] M q for any q N. By applying the inequality ab (a 2 + b 2 )/2, a, b R twice, we get h 1 (X jk, Y k ; X jl, Y l ) Yk 2Y l 2/2 + 1 X 2 jk X jl 2 Yk 4 + Y l 4 + Xjk 2 + X2 jl, which yields 62

23 E[ h 1 q ] (2 q 1) 2 E[Y 4q k + Y 4q l + X 2q jk + X2q jl )] < by the C r inequality and assumption (A1). Thus, if we choose M = n γ for 0 < γ < 1/2 κ, then S j 1,2 ϵ/2 for sufficiently large n (in the case ϵ = cn κ as will be specified later, q can be any integer greater than 2κ/γ). Hence, P ( S j 1n,2 S1,2 j ϵ) P ( S 1n,2 j ϵ/2). Since the event { S 1n,2 j ϵ/2} implies the event {Yk 4 + X2 jk M/2 for some 1 k n}, we have that P { S 1n,2 j ϵ/2} P ( n {Yk 4 + Xjk 2 M/2}) P ({Yk 4 + Xjk 2 M/2}) = np ({Yk 4 + Xjk 2 M/2}), where we have applied Bonferroni s inequality in the second inequality above. Invoking assumption (A1) and Markov s inequality, there must exist a constant C, s.t. P (Y 4 k +X2 jk M/2) P (Y 2 k M/2)+P (X 2 jk M/4) C exp( s M/2) for any j, k, and s (0, 2s 0 ]. Consequently, for sufficiently large n, max 1 j p P ( S j 1n,2 S j 1,2 ϵ) max 1 j p P ( S j 1n,2 ϵ/2) max 1 j p np (Y 4 k + X2 jk M/2) Cn exp( s M/2). In combination with the convergence result of S j 1n,1, we get that for large enough n, P ( S j 1n S j 1 4ϵ) P ( S j 1n S j 1 2ϵ) P ( S j 1n,1 S j 1,1 ϵ) + P ( S j 1n,2 S j 1,2 ϵ) 2 exp( ϵ 2 n 1 2γ /4) + Cn exp( sn γ/2 /2). Part II: Consistency of S j 2n Denote S j 2n as S j 2n = S j 2n,1 S j 2n,2, where S j 2n,1 = n 2 n k,l=1 X jk X jl, and S j 2n,2 = n 2 n k,l=1 Y ky l. Similarly, write its population counterpart as S j 2 = S j 2,1 S j 2,2, where S j 2,1 = E X j X j and S j 2,2 = E(Y Y ). Following the similar arguments in Part I, we can show that P ( S j 2n,1 S j 2,1 4ϵ) 2 exp( ϵ 2 n 1 2γ /4) + Cn exp( sn 2γ /4) P ( S j 2n,2 S j 2,2 4ϵ) 2 exp( ϵ 2 n 1 2γ /4) + Cn exp( sn γ ). Assumption (A1) ensures that S j 2,1 = E X j X j (E X j X j 2 ) 1/2 [4E( X j 2 )] 1/2 and S j 2,2 = E(Y Y ) 1E(Y 2 + Y 2 ) = E(Y 2 ) are both uniformly bounded. Let C be a 2 sufficiently large constant, which satisfies ( ) C > max {S2,1} j p j=1, {Sj 2,2} p j=1, {E[exp(sX2 j )]} p j=1, E[exp(sY 2 )], 1 for s (0, 2s 0 ]. 63

24 Note that S j 2n S j 2 = S j 2n,1 S j 2n,2 S j 2,1 S j 2,2 = (S j 2n,1 S j 2,1)(S j 2n,2 S j 2,2) + S j 2,1(S j 2n,2 S j 2,2) + S j 2,2(S j 2n,1 S j 2,1). Therefore, by utilizing above inequalities repeatedly, we can show that P ( (S j 2n,1 S j 2,1)(S j 2n,2 S j 2,2) ϵ) P ( S j 2n,1 S j 2,1 ϵ) + P ( S j 2n,2 S j 2,2 ϵ) 4 exp( ϵn 1 2γ /64) + 2Cn exp( sn γ ), P ( S j 2,1(S j 2n,2 S j 2,2) ϵ) P ( S j 2n,2 S j 2,2 ϵ/c) 2 exp( ϵ 2 n 1 2γ /(64C 2 ))+Cn exp( sn γ ), and P ( S j 2,2(S j 2n,1 S j 2,1) ϵ) P ( S j 2n,1 S j 2,1 ϵ/c) 2 exp( ϵ 2 n 1 2γ /(64C 2 ))+Cn exp( sn 2γ /4). It follows from Bonferroni s inequality that, P ( S j 2n S j 2 3ϵ) P ( (S j 2n,1 S j 2,1)(S j 2n,2 S j 2,2) ϵ) + P ( S j 2,1(S j 2n,2 S j 2,2) ϵ) + P ( S j 2,2(S j 2n,1 S j 2,1) ϵ) 8 exp( ϵ 2 n 1 2γ /(64C 2 )) + 4Cn exp( sn γ ). Part III: Consistency of S j 3n Define the corresponding U-statistic: S j 3n = {n(n 1)(n 2)} [ 1 Y k Y l X jk X jh + Y k Y h X jk X jl + Y l Y k X jl X jh k<l<h ] +Y l Y h X jl X jk + Y h Y k X jh X jl + Y h Y l X jh X jk = 6{n(n 1)(n 2)} 1 h 3 (X jk, Y k ; X jl, Y l ; X jh, Y h ), k<l<h where h 3 (X jk, Y k ; X jl, Y l ; X jh, Y h ) is the kernel function. Following the same argument to deal with S 1n, j we write S j 3n as S j 3n = 6{n(n 1)(n 2)} 1 k<l<h h 3I( h 3 M) + 6{n(n 1)(n 2)} 1 k<l<h h 3I( h 3 > M) = S j 3n,1 + S j 3n,2 and its population counterpart as S j 3 = E[h 3 I{ h 3 M}] + E[h 3 I{ h 3 > M}] = S j 3,1 + S3,2. j By using the same argument for S 1n,1, j we can show that P ( S j 3n,1 S j 3,1 ϵ) 2 exp{ ϵ 2 m /(2M 2 )}, where m = n/3 due to the fact S j 3n is a third-order U-statistic. Now, it remains to establish the uniform convergency of the other part S j 3n,2. Note that h 3 (X jk, Y k ; X jl, Y l ; X jh, Y h ) 64

25 [Yk 4 +Y l 4 +Yh 4 +X2 jk +X2 jh +X2 jl ], so the event { S 3n,2 j ϵ/2} implies the event {Yk 4 +X2 jk > M/3, for some 1 k n}. Therefore, following a similar argument as presented in Part I, we have P ( S j 3n,2 S j 3,2 ϵ) P ( S j 3n,2 ϵ/2) P ( n [Y 4 k + X 2 jk M/3]) Cn exp( s M/ 6) for any j, k and s (0, 2s 0 ]. Combining the two convergence results for S j 3n,1 and S j 3n,2 with M = n γ for some 0 < γ < 1/2 κ, it follows that P ( S j 3n S j 3 2ϵ) 2 exp( ϵ 2 n 1 2γ /6) + Cn exp( sn γ/2 / 6). Note that S j 3n S j 3 = (n 1)(n 2) n 2 ( S j 3n S j 3) 3n 2 n 2 S j 3 + n 1 n 2 ( S j 1n S1) j + n 1 S n 1. j Following 2 the similar argument in dealing with S j 1, we can show S j 3 is also uniformly bounded in j. Therefore, with a sufficiently large n, (3n 2)S j 3/n 2 and (n 1)S j 1/n 2 are both smaller than ϵ (in the case ϵ = cn κ, this also holds). Then, P ( S j 3n S j 3 4ϵ) P ( S j 3n S j 3 ϵ) + P ( S j 1n S j 1 ϵ) 4 exp( ϵ 2 n 1 2γ /24) + 2Cn exp( sn γ/2 / 6). This, together with the consistency in Part I and Part II, yields that P { (2S j 3n S j 1n S j 2n) (2S j 3 S j 1 S j 2) ϵ} P ( S j 3n S j 3 ϵ 4 ) + P ( Sj 2n S j 2 ϵ 4 ) + P ( Sj 1n S j 1 ϵ 4 ) =O{exp( c 1 ϵ 2 n 1 2γ ) + n exp( c 2 n γ/2 )} for some positive constants c 1 and c 2 and the bound is uniform with respect to j = 1,, p. Analyzing the denominator of ω j would generate the same form of convergence rate, so we omit the details here. Let ϵ = cn κ, where κ satisfies 0 < κ + γ < 1/2, we then have P { max 1 j p ω j ω j cn κ } p max 1 j p P { ω j ω j cn κ } O(p[exp{ c 1 n 1 2(κ+γ) } + n exp( c 2 n γ/2 )]) which finishes our proof for the first part of theorem. If D E D E, then there exist some j D E, such that ω j < cn κ. According to assumption (A2), this particular j would make ω j ω j cn κ, which implies that A = {D E D E } { ω j ω j cn κ, for some j D E } = B 65

26 and hence B c A c. Therefore, P (A c ) P (B c ) = 1 P (B) = 1 P ( ω j ω j cn κ, for some j D E ) 1 s n max j D E P ( ω j ω j cn κ ) 1 O(s n [exp{ c 1 n 1 2(κ+γ) } + n exp( c 2 n γ/2 )]), where the first inequality above is due to Bonferroni s inequality. The proof is thus complete. Proof of Theorem 6: We shall show the uniform consistency of ω j (Ŵ ) := MDCj n(ŵ )2 under the assumptions (B1) and (B2). Due to the similarity of its numerator and denominator, we only deal with the numerator part, i.e., the consistency of MDDn(Ŵ j )2. First we demonstrate the consistency of MDDn(W j ) 2, and then study the difference between MDDn(W j ) 2 and MDDn(Ŵ j )2. Since W and W j are uniformly bounded, we can adopt the argument in the proof of Theorem 1 in Li, Zhong and Zhu (2012) (also see the proof of Theorem 5 for a slightly modified argument, where the bound can be slightly improved under the assumption that the response variable is bounded) and get that for any γ (0, 1/2 κ), there exist positive constants c 1 and c 2 such that for a sufficiently small ϵ (say ϵ = cn κ as will be specified later), P ( MDD j n(w ) 2 MDD j (W ) 2 ϵ) C[exp{ c 1 ϵ 2 n 1 2γ } + n exp( c 2 n γ )]. (12) Next we analyze the difference between MDDn(W j ) 2 and MDDn(Ŵ j )2. Denote T j 1n = n 2 n k,l=1 ŴkŴl X j jk X jl, T 2n = n 2 n k,l=1 ŴkŴl 1 n n 2 k,l=1 X jk X jl, and T j 3n = n 3 n k,l,h=1 ŴkŴh X jk X jl. Similarly T1n, j T j 2n and T j 3n are defined with {Ŵk} n replaced by {W k } n. Let C 0 = τ + 1. By using the triangle inequality and the boundness 66

27 of W k and Ŵk, we can derive that MDDn(Ŵ j )2 MDDn(W j ) 2 T j 1n T1n j + T j 2n T2n j + 2 T j 3n T3n j = 1 n 2 [ŴkŴl W k W l ] X jk X jl + 1 [ŴkŴl W k W l ] 1 X n 2 jk X jl 1 n 2 k,l= n 3 k,l,h=1 [ŴkŴh W k W h ] X jk X jl n 2 k,l=1 [ Ŵk(Ŵl W l ) + W l (Ŵk W k ) ] X jk X jl k,l=1 + 1 n 2 [ Ŵk(Ŵl W l ) + W l (Ŵk W k ) ] 1 n 2 k,l=1 + 2 n 3 4C 0 n 2 k,l,h=1 k,l=1 X jk X jl k,l=1 [ Ŵk(Ŵh W h ) + W h (Ŵk W k ) ] X jk X jl Ŵl W l X jk X jl + 4C 0 n Ŵk W k 1 n 2 k,l=1 X jk X jl =: k,l=1 We first treat 1. By the Cauchy-Schwarz inequality, we have ( 1 ) 2 16C 2 0 =16C n [ 1 n + 16C 2 0 Ŵl W l 2 1 n 2 l=1 Ŵl W l 2 1 n 2 l=1 1 n X jk X jl 2 k,l=1 X jk X jl 2 1 n k,l=1 Ŵl W l 2 E X j1 X j2 2] l=1 Ŵl W l 2 E X j1 X j2 2 =: D 1 + D 2 l=1 Noting n 1 n l=1 Ŵl W l 2 4C 2 0, then we have P ( D 1 ϵ 2 /2) P ( 1 n 2 X jk X jl 2 E X j1 X j2 2 ϵ 2 /(128C0)) 4 k,l=1 C(exp( c 1 ϵ 4 n 1 2γ ) + n exp( c 2 n 2γ )) for some positive constants c 1 and c 2, based on equation (B.7) in Li, Zhong and Zhu (2012). 67

28 Under the assumption (B2), there exists a positive constant C 2 <, such that E X j1 X j2 2 4E X j 2 < C 2. Then by Proposition 2, P ( D 2 ϵ 2 /2) P ( 1 n Ŵl W l 2 ϵ 2 /(32C0C 2 2 )) C exp( nc 3 ϵ 4 ) l=1 for small enough ϵ and some c 3 > 0. Combining the probability bounds we derived for D 1 and D 2, P ( 1 ϵ) C ( exp( c 1 ϵ 4 n 1 2γ ) + n exp( c 2 n 2γ ) + exp( c 3 nϵ 4 ) ) C ( exp( c 1 ϵ 4 n 1 2γ ) + n exp( c 2 n 2γ )), where the third term on the right hand side can be absorbed into the first term. In a similar fashion, we can derive that P ( 2 ϵ) C ( exp( c 1 ϵ 2 n 1 2γ ) + n exp( c 2 n 2γ ) ) for some positive constants c 1, c 2. Consequently, in view of (12), we have that P ( MDD j n(ŵ )2 MDD j (W ) 2 3ϵ) P ( MDD j n(w ) 2 MDD j (W ) 2 ϵ) +P ( 1 ϵ) + P ( 2 ϵ) C ( exp( c 1 ϵ 4 n 1 2γ ) + n exp( c 2 n γ ) ) for a sufficiently small ϵ and some positive constants c 1, c 2. The analysis of the denominator of MDC j n(ŵ )2 would generate a similar form of convergence rate. Therefore, if we set ϵ = cn κ, where κ satisfies 0 < 2κ + γ < 1/2, we would have P { max 1 j p ω j(ŵ ) ω j(w ) cn κ } p max 1 j p P { ω j(ŵ ) ω j(w ) cn κ } C(p[exp( c 1 n 1 2(γ+2κ) ) + n exp( c 2 n γ )]) which proves the first assertion. The second assertion follows from the same argument used in proving the second statement in Theorem 5. The proof is complete. 68

29 References Fan, J., Feng, Y., and Song, R. (2011), Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models, Journal of the American Statistical Association, 106, Fan, J., and Lv, J. (2008), Sure Independence Screening for Ultra-High Dimensional Feature Space, Journal of the Royal Statistical Society, Series B, 70, (with discussions and rejoinder). He, X., Wang, L., and Hong, H. G. (2013), Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data, Annals of Statistics, 41, Li, R., Zhong, W., and Zhu, L. (2012), Feature Screening via Distance Correlation Learning, Journal of the American Statistical Association, 107, Meier, L., van de Geer, S., and Bühlmann, P. (2009), High-Dimensional Additive Modeling, Annals of Statistics, 37, Resnick, S. I. (1999) A Probability Path. Birkhäuser, Boston. Serfling, R. J. (1980) Approximation Theorems of Mathematical Statistics, New York: John Wiley & Sons Inc. Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, 35, Székely, G. J., and Rizzo, M. L. (2009), Brownian Distance Covariance, The Annals of Applied Statistics, 3, Zhu, L., Li, L., Li, R., and Zhu, L. (2011), Model-Free Feature Screening for Ultrahigh- Dimensional Data, Journal of the American Statistical Association, 106,

arxiv: v1 [stat.ml] 5 Jan 2015

arxiv: v1 [stat.ml] 5 Jan 2015 To appear on the Annals of Statistics SUPPLEMENTARY MATERIALS FOR INNOVATED INTERACTION SCREENING FOR HIGH-DIMENSIONAL NONLINEAR CLASSIFICATION arxiv:1501.0109v1 [stat.ml] 5 Jan 015 By Yingying Fan, Yinfei