Empirical Likelihood Tests for High-dimensional Data Department of Statistics and Actuarial Science University of Waterloo, Canada ICSA - Canada Chapter 2013 Symposium Toronto, August 2-3, 2013 Based on joint work with Liang Peng and Rongmao Zhang
Contents 1 Introduction 2 3 4 5
Introduction In this talk we discuss empirical likelihood ratio tests for high-dimensional data. Let X i = (X i1,..., X ip ), i = 1, 2,, n be iid random vectors with mean µ = (µ 1,, µ p ) and covariance Σ = (σ ij ) 1 i,j p. Here p may depend on n. If p is fixed, then it is a traditional statistical setting. If p, then it is high-dimensional setting.
Typical testing questions: Testing µ = µ 0 (one sample mean test). Two sample means testing. Testing Σ = Σ 0 (covariance matrix test) Two sample covariance matrices testing. We focus on testing covariance matrices.
Problem setup. Let X i = (X i1,..., X ip ), i = 1, 2,, n be iid random vectors with mean µ = (µ 1,, µ p ) and covariance Σ = (σ ij ) 1 i,j p.
Problem setup. Let X i = (X i1,..., X ip ), i = 1, 2,, n be iid random vectors with mean µ = (µ 1,, µ p ) and covariance Σ = (σ ij ) 1 i,j p. Testing covariance matrix H 0 : Σ = Σ 0 against H 1 : Σ Σ 0. (1)
Problem setup. Let X i = (X i1,..., X ip ), i = 1, 2,, n be iid random vectors with mean µ = (µ 1,, µ p ) and covariance Σ = (σ ij ) 1 i,j p. Testing covariance matrix H 0 : Σ = Σ 0 against H 1 : Σ Σ 0. (1) Testing bandedness H 0 : σ ij = 0 for all i j τ. (2)
Problem setup. Let X i = (X i1,..., X ip ), i = 1, 2,, n be iid random vectors with mean µ = (µ 1,, µ p ) and covariance Σ = (σ ij ) 1 i,j p. Testing covariance matrix H 0 : Σ = Σ 0 against H 1 : Σ Σ 0. (1) Testing bandedness H 0 : σ ij = 0 for all i j τ. (2) Non-parametric. No information about sparsity.
Literature. Testing (1) for fixed p: traditional likelihood ratio test; scaled distance measure test (John (1971, 1972) and Nagao (1973)).
Literature. Testing (1) for fixed p: traditional likelihood ratio test; scaled distance measure test (John (1971, 1972) and Nagao (1973)). Testing (1) for divergent p: Ledoit and Wolf (2002) for normal X i and Chen, Zhang and Zhong (2010) for a factor model. Specific models are imposed. Restrictions are put on p. p has to go to infinity as n approaches infinity.
Testing (2) for divergent p: Cai and Jiang (2011). The test statistic: the coherence converges slowly. Normality are assumed. Restrictions are put on p and τ.
Testing (2) for divergent p: Cai and Jiang (2011). The test statistic: the coherence converges slowly. Normality are assumed. Restrictions are put on p and τ. Testing (2) for divergent p: Qiu and Chen (2012). Similar to Chen, Zhang and Zhong (2010), specific models; restrictions.
Our goal: build up a test statistic that works for both (1) and (2); loose the condition on p; get rid of specific models.
Our goal: build up a test statistic that works for both (1) and (2); loose the condition on p; get rid of specific models. First we assume µ is known. The case when µ is unknown is very similar.
Basic observations. Σ = Σ 0 is equivalent to D 2 := Σ Σ 0 2 F = tr((σ Σ 0) 2 ) = 0.
Basic observations. Σ = Σ 0 is equivalent to D 2 := Σ Σ 0 2 F = tr((σ Σ 0) 2 ) = 0. We can construct our test based on an estimator of D 2.
A natural estimator. For i = 1,..., n, define the p p matrix Y i = (X i µ)(x i µ) T, and estimator e(γ) = tr((y 1 Γ)(Y 2 Γ)). E[Y 1 ] = Σ and E[e(Σ 0 )] = D 2. E[e(Σ 0 )] = 0 is equivalent to Σ = Σ 0.
We need independent copies of (Y 1, Y 2 ).
We need independent copies of (Y 1, Y 2 ). Splitting the sample. Let N = [n/2]. For i = 1, 2,..., N, we define e i (Σ) = tr((y i Σ)(Y N+i Σ)).
We need independent copies of (Y 1, Y 2 ). Splitting the sample. Let N = [n/2]. For i = 1, 2,..., N, we define e i (Σ) = tr((y i Σ)(Y N+i Σ)). Very difficult to estimate the variance of e i.
We need independent copies of (Y 1, Y 2 ). Splitting the sample. Let N = [n/2]. For i = 1, 2,..., N, we define e i (Σ) = tr((y i Σ)(Y N+i Σ)). Very difficult to estimate the variance of e i. Empirical likelihood method automatically catches the asymptotic variance.
Define the empirical likelihood ratio function with constraint (estimating equation) E[e 1 (Σ)] = 0: N L 0 (Σ) = sup{ (Np i ) : i=1 N N p i = 1, p i e i (Σ) = 0, p i 0}. i=1 i=1
Define the empirical likelihood ratio function with constraint (estimating equation) E[e 1 (Σ)] = 0: N L 0 (Σ) = sup{ (Np i ) : i=1 N N p i = 1, p i e i (Σ) = 0, p i 0}. i=1 i=1 Under some regularity conditions and H 0, 2 log L 0 (Σ 0 ) converges weakly to χ 2 1. This seems good but...
Define the empirical likelihood ratio function with constraint (estimating equation) E[e 1 (Σ)] = 0: N L 0 (Σ) = sup{ (Np i ) : i=1 N N p i = 1, p i e i (Σ) = 0, p i 0}. i=1 i=1 Under some regularity conditions and H 0, 2 log L 0 (Σ 0 ) converges weakly to χ 2 1. This seems good but... Shortfall of the test based on L 0. When Σ Σ 0 2 F is small, E[e 1(Σ 0 )] will be very close to 0 (in a rate of Σ Σ 0 2 F ). In this case the test based on L 0 has a poor power. Later we will see this in power analysis.
Secondary constraint. We add one more constraint which is easier to break under H 1.
Secondary constraint. We add one more constraint which is easier to break under H 1. The choice of the second linear constraint can be arbitrary.
Secondary constraint. We add one more constraint which is easier to break under H 1. The choice of the second linear constraint can be arbitrary. With no prior information, the following statistics v i (Σ): v i (Σ) = 1 T p (Y i Y N+i 2Σ)1 p can be used in a constraint E[v 1 (Σ 0 )] = 0.
We define the empirical likelihood function with two constraints as N L 1 (Σ 0 ) = sup{ (Np i ) : i=1 N N p i = 1, i=1 i=1 ( ) ei (Σ 0 ) p i = v i (Σ 0 ) ( ) 0, p i 0}. 0
We define the empirical likelihood function with two constraints as N L 1 (Σ 0 ) = sup{ (Np i ) : i=1 N N p i = 1, i=1 i=1 ( ) ei (Σ 0 ) p i = v i (Σ 0 ) ( ) 0, p i 0}. 0 Theorem 1 Suppose e 1 (Σ 0 ) and v 1 (Σ 0 ) satisfy a regularity condition (P). Then under H 0, 2 log L 1 (Σ 0 ) converges in distribution to χ 2 2 as n.
CLT condition (similar to the Lyapunov condition). (P) We say a statistic T with size n satisfies condition (P) if ET 2 > 0 and for some δ > 0, E T 2+δ δ+min(δ,2) (ET 2 = o(n 4 ). ) 1+δ/2 For example, if E(T 4 )/(E(T 2 )) 2 = o(n), then T satisfies (P) with δ = 2.
Remark 1 In order to prove Theorem 1, it is sufficient to prove condition (P) e guarantees that the sample t i = ( i (Σ 0 ), v i (Σ 0 ) Var(ei (Σ 0 )) Var(vi (Σ 0 )) )T satisfies CLT and ti 2 satisfies LLN, with a controlled maximum.
Remark 1 In order to prove Theorem 1, it is sufficient to prove condition (P) e guarantees that the sample t i = ( i (Σ 0 ), v i (Σ 0 ) Var(ei (Σ 0 )) Var(vi (Σ 0 )) )T satisfies CLT and t 2 i satisfies LLN, with a controlled maximum. Remark 2 When µ is unknown, just replace µ in Y i by the sample means and the theorem still holds with one extra moment condition.
Remark 3 In the factor model considered by Chen, Zhang and Zhong (2010), e 1 (Σ 0 ) and v 1 (Σ 0 ) satisfy (P). With this model, our test allows p to diverge arbitrarily fast or stay finite.
The problem is testing Here we consider µ is known. H 0 : σ ij = 0 for all i j τ. (3)
We are interested the information of the black squares in Σ and we will ignore the stars. Basic observation. H 0 is equivalent to the black squares of Σ being 0.
Define the τ-off-diagonal upper triangular matrix M (τ) of a matrix M: { (M (τ) Mij j i + τ; ) ij = 0 j < i + τ. H 0 is equivalent to tr((σ (τ) ) T Σ (τ) ) = 0.
For i = 1,..., N, Let e i = tr((y (τ) i ) T Y (τ) N+i ), v i = 1 T p (Y (τ) i + Y (τ) N+i )1 p. We define the empirical likelihood function as N L 2 = sup{ (Np i ) : i=1 N N p i = 1, i=1 Here we omit the Σ 0 in e i and v i. i=1 ( ) e p i i v i = ( ) 0, p i 0}. 0
Theorem 2 Suppose that e 1 and v 1 satisfy (P). Then under H 0 in (3), 2 log L 2 converges in distribution to χ 2 2 as n.
Theorem 2 Suppose that e 1 and v 1 satisfy (P). Then under H 0 in (3), 2 log L 2 converges in distribution to χ 2 2 as n. The method can be used to test some other structures. One interesting application is to test the assumption or estimation of the sparsity.
Remark 4 (1) In the Gaussian model used by Cai ( and Jiang (2011), e 1 and v 1 satisfy (P) provided that τ = o 1 i,j p σ ij. ( 1 i,j p σ ij ) 1/2 )
Remark 4 (1) In the Gaussian model used by Cai ( and Jiang (2011), e 1 and v 1 satisfy (P) provided that τ = o 1 i,j p σ ij. ( 1 i,j p σ ij ) 1/2 ) (2) With moment or boundedness conditions, the Gaussian assumption can be removed.
Denote π 11 = E(e 1 (Σ) 2 ), π 22 = E(v 1 (Σ) 2 ), ζ 1 = tr((σ Σ 0 ) 2 )/ π 11 and ζ 2 = 21 T p (Σ Σ 0 )1 p / π 22. For most models we discuss, ( tr((σ Σ0 ) 2 ) ) ζ 1 = O tr(σ 2 ) and ( ) 1 T p (Σ Σ 0 )1 p ζ 2 = O 1 T p Σ 2. 1 p
Theorem 3 In addition to the conditions of Theorem 1, if H 1 : Σ Σ 0 holds, then P{ 2 log L 1 (Σ 0 ) > ξ 1 α } = P{χ 2 2,ν > ξ 1 α } + o(1) for any level α as n, where χ 2 2,ν is a noncentral chi-square distribution with two degrees of freedom and noncentrality parameter ν = N(ζ1 2 + ζ2 2 ),
Remark 5 From the above power analysis, the new test rejects the null hypothesis with probability tending to one when max( nζ 1, n ζ 2 ).
Remark 5 From the above power analysis, the new test rejects the null hypothesis with probability tending to one when max( nζ 1, n ζ 2 ). Note that the test given in Chen, Zhang and Zhong (2010) requires nζ 1.
Remark 5 From the above power analysis, the new test rejects the null hypothesis with probability tending to one when max( nζ 1, n ζ 2 ). Note that the test given in Chen, Zhang and Zhong (2010) requires nζ 1. Our test may have a better power or a worse power in different settings.
Remark 5 From the above power analysis, the new test rejects the null hypothesis with probability tending to one when max( nζ 1, n ζ 2 ). Note that the test given in Chen, Zhang and Zhong (2010) requires nζ 1. Our test may have a better power or a worse power in different settings. Same results for the test in Theorem 2.
It is clear that our tests is powerful when Σ Σ 0 is dense, and not powerful when Σ Σ 0 is sparse.
It is clear that our tests is powerful when Σ Σ 0 is dense, and not powerful when Σ Σ 0 is sparse. With only the first constraint E(e 1 (Σ 0 )) = 0, the test power (requires nζ 1 ) is worse than the test in Chen, Zhang and Zhong (2010).
It is clear that our tests is powerful when Σ Σ 0 is dense, and not powerful when Σ Σ 0 is sparse. With only the first constraint E(e 1 (Σ 0 )) = 0, the test power (requires nζ 1 ) is worse than the test in Chen, Zhang and Zhong (2010). The test in Cai and Jiang (2011) is good when Σ Σ 0 is sparse but is powerless when Σ Σ 0 is dense, since their test power depends on Σ Σ 0 max.
Testing covariance matrices We assume a dense model and a local alternative. We compare with Chen, Zhang and Zhong (2010) for testing covariance matrices and Cai and Jiang (2011) for testing bandedness. The ELT has biased size for small n, so we also give a bootstrap calibrated version of ELT.
Table : Testing covariance matrices (n, p) EL(0.05) BCEL(0.05) CZZ(0.05) EL(0.05) BCEL(0.05) CZZ(0.05) δ = 0 δ = 0 δ = 0 δ = 1 δ = 1 δ = 1 (50, 25) 0.127 0.054 0.053 0.296 0.118 0.219 (50, 50) 0.148 0.065 0.067 0.324 0.136 0.216 (50, 100) 0.138 0.068 0.038 0.317 0.125 0.212 (50, 200) 0.168 0.081 0.041 0.310 0.113 0.221 (50, 400) 0.151 0.071 0.045 0.342 0.145 0.242 (50, 800) 0.154 0.064 0.041 0.337 0.137 0.219 (200, 25) 0.065 0.048 0.052 0.348 0.305 0.179 (200, 50) 0.058 0.052 0.041 0.336 0.298 0.162 (200, 100) 0.068 0.054 0.059 0.353 0.319 0.179 (200, 200) 0.056 0.051 0.058 0.358 0.322 0.155 (200, 400) 0.069 0.064 0.051 0.374 0.343 0.180 (200, 800) 0.058 0.047 0.050 0.366 0.338 0.182
Table : Testing bandedness (n, p) EL(0.05) BCEL(0.05) CJ(0.05) EL(0.05) BCEL(0.05) CJ(0.05) δ = 0 δ = 0 δ = 0 δ = 1 δ = 1 δ = 1 (50, 25) 0.118 0.036 0.015 0.272 0.093 0.017 (50, 50) 0.124 0.049 0.010 0.266 0.097 0.018 (50, 100) 0.126 0.057 0.005 0.268 0.099 0.004 (50, 200) 0.128 0.058 0.003 0.268 0.100 0.001 (50, 400) 0.113 0.053 0.002 0.282 0.121 0.001 (50, 800) 0.128 0.062 0.001 0.281 0.109 0.000 (200, 25) 0.078 0.062 0.019 0.288 0.253 0.034 (200, 50) 0.074 0.059 0.033 0.323 0.286 0.020 (200, 100) 0.057 0.053 0.019 0.332 0.304 0.044 (200, 200) 0.066 0.046 0.024 0.293 0.263 0.032 (200, 400) 0.061 0.052 0.020 0.336 0.304 0.016 (200, 800) 0.053 0.046 0.026 0.317 0.297 0.025
Conclusion. The new technique works for non-parametric models; allows arbitrary p; requires only moment conditions; avoids to estimate asymptotic variance; the limiting distribution is always χ 2 2 ; can be applied to testing sample mean, two-sample means, and two-sample covariance matrices under the HD framework.
Shortfalls: the number of observations is reduced by half; the power is good in the dense setting but not in the sparse setting. The optimal choice of the second constraint is unknown.
References I Cai, T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39, 1496 1525. Chen, S., Zhang, L. and Zhong, P. (2010). Tests for High-Dimensional Covariance Matrices. J. Amer. Statist. Assoc. 105, 810-819. Ledoit, O., and Wolf, M. (2002). Some Hypothesis Tests for the Covariance Matrix When the Dimension Is Large Compare to the Sample Size. Ann. Statist., 30, 1081 1102.
References II Nagao, H. (1973). On some test criteria for covariance matrix. Ann. Statist., 1, 700 709. Qin, J. and Lawless, J.F. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22, 300 325. Qiu, Y. and Chen, S. (2012). Test for bandedness of high-dimensional covariance matrices and bandwidth estimation. Ann. Statist. 40, 1285-1314.
Thank you!