Department of Mathematics, Graduate School of Science Hiroshima University Kagamiyama, Higashi Hiroshima, Hiroshima , Japan.

Size: px

Start display at page:

Download "Department of Mathematics, Graduate School of Science Hiroshima University Kagamiyama, Higashi Hiroshima, Hiroshima , Japan."

Bertha Randall
5 years ago
Views:

1 High-Dimensional Properties of Information Criteria and Their Efficient Criteria for Multivariate Linear Regression Models with Covariance Structures Tetsuro Sakurai and Yasunori Fujikoshi Center of General Education, Tokyo University of Science, Suwa Toyohira, Chino, Nagano , Japan Department of Mathematics, Graduate School of Science Hiroshima University -3- Kagamiyama, Higashi Hiroshima, Hiroshima , Japan Abstract In this paper, first we consider high-dimensional consistency properties of information criteria IC g,d and their efficient criteria EC g,d for selection of variables in multivariate regression model with covariance structures. The covariance structures considered are () independent covariance structure, (2) uniform covariance structure and (3) autoregressive covariance structure. Sufficient conditions for IC g,d and EC g,d to be consistent are derived under a high-dimensional asymptotic framework such that the sample size n and the number p of response variables are large as in the way p/n c (0, ). Our results are checked numerically by conducting a Mote Carlo simulation. Next we discuss with high-dimensional properties of AIC and BIC for selecting (4) independence covariance structure with different variances, and (5) no covariance structure, in addition to the covariance structures () (3). Some tendancy is pointed through a numerical experiment. AMS 2000 subject classification: primary 62H2; secondary 62H30 Key Words and Phrases: Consistency property, Information criteria, Covariance structures, Efficient criteria, High-dimensional asymptotic framework, Multivariate linear regression.

2 . Introduction We consider a multivariate linear regression of p response variables y,..., y p on a subset of k explanatory variables x,..., x k. Suppose that there are n observations on y = (y,..., y p ) and x = (x,..., x k ), and let Y : n p and X : n k be the observation matrices of y and x with the sample size n, respectively. The multivariate linear regression model including all the explanatory variables is written as Y N n p (XΘ, Σ I n ), (.) where Θ is a k p unknown matrix of regression coefficients and Σ is a p p unknown covariance matrix. The notation N n p (, ) means the matrix normal distribution such that the mean of Y is XΘ and the covariance matrix of vec (Y) is Σ I n, or equivalently, the rows of Y are independently normal with the same covariance matrix Σ. Here, vec(y) be the np column vector obtained by stacking the columns of Y on top of one another. We assume that rank(x) = k. We consider the problem of selecting the best model from a collection of candidate models specified by a linear regression of y on subvectors of x. Our interest is to examine consistency properties of information criteria and their efficient criteria when p/n c (0, ). When Σ is unknown positive definite, it has been pointed (see, e.g., Yanagihara et al. (205), Fujikoshi et al. (204), etc.) that AIC and Cp have consistency properties when p/n c (0, ), under some conditions, but BIC is not necessarily consistent. Related to high-dimensional data, it is important to consider selection of regression variables in the case that p is larger than n, and k is also large. When p is large, it will be natural to consider a covariance structure, since a covariance matrix with no covariance structure involves many unknown parameters. One way is to consider a sparse method or a joint regularization 2

3 of the regression parameters and the inverse covariance matrix, see, e.g., Rothman et al. (200). As another approach, it may consider to select the regression variables, assuming a simple covariance structure. Related to the later approach, it might occur to select an appropriate simple covariance structure from a set of some simple covariance structures. In this paper, first we consider the variables selection problems under some simple covariance structures such as () independent covariance structure, (2) uniform covariance structure and (3) autoregressive covariance structure, based on model selection approach. We study consistency properties of information criteria including AIC and BIC. These information criteria have a computational problem when k becomes large. In order to avoid such problem, we consider their efficient criteria based on Zho et al. (986) and Nishii et al. (988). It is shown that the efficient criteria also consistent in a high dimensional situation. Our results are checked numerically by conducting a Mote Carlo simulation. Next we discuss with AIC and BIC for selecting (4) independence covariance structure with different variances, and (5) no covariance structure, in addition to covariance structures () (3). In a high-dimensional situation, it is noted that the covariance structures except for (2) are identified by AIC and BIC, through a simulation experiment. The present paper is organized as follows. In section 2, we present notations and preliminaries. In Sections 3, 4 and 5 we show high-dimensional consistencies of information criteria under the covariance structures (), (2) and (3), respectively. These are also numerically studied. In Section 6, their efficient criteria are shown to be consistent under the same condition as in the case of information criteria. In Section 7, we discuss with selections of covariance structures by AIC and BIC. In Section 8, our conclusions are discussed. The proofs of our results are given in Appendix. 3

4 2. Notations and Preliminaries This paper is concerned with selection of explanatory variables in multivariate regression model (.). Suppose that j denotes a subset of ω = {,..., k} containing k j elements, and X j denotes the n k j matrix consisting the columns of X indexed by the elements of j. Then, X ω = X. Further, we assume that the covariance matrix Σ have a covariance structure Σ g. Then a generic candidate model can be expressed as M g,j : Y N n p (X j Θ j, Σ g,j I n ), (2.) where Θ j is a k j p unknown matrix of regression coefficients. We assume that rank(x) = k(< n). As a model selection method, we use a generalized criterion of AIC (Akaike (973)). When Σ g,j is a p p unknown covariance matrix, the AIC (see, e.g., Bedrick and Tsai (994), Fujikoshi and Satoh (997)) for M g,j is given by AIC g,j = n log ˆΣ g,j + np(log 2π + ) + 2 {k j p + 2 p(p + ) }, (2.2) where n ˆΣ g,j = Y (I n P j )Y and P j = X j (X jx j ) X j. The part of n log ˆΣ g,j + np(log 2π+) is 2 log max Mg,j f(y; Θ j, Σ g,j ), where f(y; Θ j, Σ g,j ) is the density function of Y under M g,j. The AIC was introduced as an asymptotic unbiased estimator for the risk function defined as the expected logpredictive-likelihood or equivalently the Kullback-Leibler information, for a candidate model M g,j, see, e.g., Fujikoshi and Satoh (997). When j = ω, the model M g,ω is called the full model. Note that ˆΣ g,ω and P ω are defined from ˆΣ g,j and P j as j = ω, k ω = k and X ω = X. In this paper, first we consider the case that the covariance matrix Σ belongs to each of the following three classes; () Independent covariance structure (IND); Σ v = σ 2 vi p, (2) Uniform covariance structure (UNIF); Σ u = σ 2 u(ρ δ ij u ) i,j p, (3) Autoregressive covariance structure (AUTO); Σ a = σ 2 a(ρ i j a ) i,j p. 4

5 Our candidate model can be expressed as (2.) with Σ v,j, Σ u,j or Σ a,j for Σ g,j. For deriving the maximum likelihood under M g,j, we shall use the fact that for any positive definite Σ g,j, max Θ j f(y; Θ j, Σ g,j ) = np log Σ g,j + np(log 2π + ) + min trσ g,j(y X j Θ j ) (Y X j Θ j ) (2.3) Θ j = np log Σ g,j + np log 2π + trσ g,jy(i n P j )Y. Let ˆΣ g,j be the quantity minimizing the right side of (2.3). Then, in our problem, it satisfies tr ˆΣ g,jy(i n P j )Y = np. We consider a general information criterion defined by IC g,d,j = 2 log f(y; ˆΘ j, ˆΣ g,j ) + dm g,j = np log Σ g,j + np(log 2π + ) + dm g,j, (2.4) where m g,j is the number of independent unknown parameters under M g,j, and d is a positive constant which may depend on n. When d = 2 and d = log n, IC g,2,j = AIC g,j, IC g,log n,j = BIC g,j. Such general information criterion was considered in a univariate regression model by Nishii (984). For each of the three covariance structures, consider to select the best model from all the models or a subset of all the models. Let F be the set of all the candidate models, which is denoted by {{},..., {k}, {, 2},..., {,..., k}}, or its subfamily. Then, our model selction criterion is to select the model M j or the subset j minimizing IC g,d,j, which is written as ĵ ICg,d = arg min j F IC g,d,j. (2.5) 5

6 For studying consistency properties of ĵ ICg,d, it is assumed that the true model M g, is included in the full model, i.e., M g, : Y N n p (X Θ, Σ g, I n ). (2.6) Let us denote the minimum model including the true model by M g,j. The true mean of Y is expresed as X Θ = X j Θ j for some k j p matrix Θ j. So, the notation X j Θ j is also used for the true mean of Y. Let F separate into two sets, one is a set of overspecified models, i.e., F + = {j F j j} and the other is a set of underspecified models, i.e., F = F+ c F. Here we list some of our main assumptions: A (The true model): M g, F. A2 (The asymptotic framework): p, n, p/n c (0, ). It is said that a general model selection criterion ĵ ICg,d is consistent if lim Pr(ĵ ICg,d = j ) =. p/n c (0, ) In order to obtain ĵ ICg,d, we must calcurate IC g,d for all the subsets of j ω = {, 2,..., k}, i.e., 2 k IC g,d. This will become extensive computation as k becomes large. As a method of overcoming this weak point, we consider EC criterion based on Zho et al. (986) and Nishii et al. (988). Let j (i) be the subset of j ω omitting the i ( i k). Then EC g,d is defined to select ĵ ECg,d = {i j ω EC g,d,j(i) > EC g,d,jω, i =,..., k}. (2.7) In Section 6 we shall show that ĵ ECg,d has an consistency property. 6

7 3. IC under Independent Covariance Structure In this section we consider the problem of selecting the regression variables in the multivariate regression model under the assumption that the covariance matrix has an independent covariance structure. A generic candidate model can be expressed as M v,j : Y N n p (X j Θ j, Σ v,j I n ), (3.) where Σ v,j = σ 2 v,ji p and σ v,j > 0. Then, we have 2 log f(y; Θ j, σv,j) 2 = np log(2π) + np log σv,j 2 + tr(y X σv,j 2 j Θ j ) (Y X j Θ j ). Therefore, it is easily seen that the maximum estimators of Θ j and σ 2 v,j under M v,j are given as ˆΘ j = (X jx j ) X jy, ˆσ 2 v,j = np try (I n P j )Y. (3.2) The information criterion (2.4) is given by IC v,d,j = np log ˆσ 2 v,j + np(log 2π + ) + d m v,j, (3.3) where d is a positive constant and m v,j model is expressed as = k j p +. Assume that the true M v, : Y N n p (X Θ, σv, I 2 p I n ), (3.4) and denote the minimum model including the true model M v, by M v,j. In general, Y (I n P j )Y is distributed as a nocentral Wishart distribution, more precisely Y (I n P j )Y W p (n k j, Σ v, ; (X j Θ j ) (I n P j )X j Θ j ), which implies the following Lemma. 7

8 is distributed as a noncentral chi- Lemma 3.. Under (3.4), npˆσ v,j/σ 2 v, 2 square distribution χ 2 (n k j )p (δ2 v,j), where When j F +, then, δ 2 v,j = 0. δv,j 2 = tr(x σv, 2 j Θ j ) (I n P j )X j Θ j = tr(x σv, 2 j Θ j ) (P ω P j )X j Θ j. (3.5) For a sufficient condition for consistency of IC v,g, we assume A3v : For any j F, δ 2 v,j = O(np), and lim p/n c np δ2 v,j = η 2 v,j > 0. (3.6) Theorem 3.. Suppose that the assumptions A, A2 and A3v are satisfied. Then, the information criteria IC v,d defined by (3.3) is consistent if d > and d/n 0. AIC and BIC satisfy the conditions d > and d/n 0, and we have the following result. Corollary 3.. Under the assumptions A, A2 and A3v AIC and BIC are consistent. In the following we numerically examine the validity of our claims. The true model was assumed as M v, : Y N n p (X Θ, σ 2 v, I p I n ). Here Θ : 3 p was determined by random numbers from the uniform distribution on (, 2), i.e., i.i.d. from U(, 2). The first column of X ω : n 0 is n, and the other elements are i.i.d. from U(, ). The true variance was set as σ 2 v, = 2. The five candidate models M jα, α =, 2,..., 5 were considered, where j α = {,..., α}, We studied selection percentages of the true model for 0 4 replications under AIC and BIC for (n, p) = (50, 5), (00, 30), (200, 60), (50, 00), (00, 200), (200, 400) 8

9 The results are given in Tables 4. and 4.2. It is seen that the true model has been selected for all the cases except the case (n, p) = (50, 5) of AIC v. In the case (n, p) = (50, 5) of AIC v, the selection percentage is not 00, but it is very high. Table 3.. Selection percentages of AIC v and BIC v for p/n = 0.3 AIC v BIC v j (50,5) (00,30) (200,60) (50,5) (00,30) (200,60) Table 3.2. Selection percentages of AIC v and BIC v for p/n = 2 AIC v BIC v j (50,00) (00,200) (200,400) (50,00) (00,200) (200,400) IC under Uniform Covariance Structure In this section we consider model selection criterion when the covariance matrix has a uniform covariance structure Σ u = σ 2 u(ρ δ ij u ) = σ 2 u{( ρ u )I p + ρ u p p}. (4.) The covariance structure is expressed as Σ u = τ (I p p G p 9 ) + τ 2 p G p,

10 where τ = σ 2 u( ρ u ), τ 2 = σ 2 u{ + (p )ρ u }, G p = p p, and p = (,..., ). Noting that the matrices I p p G p and p G p are orthogonal idempotent matrices, we have Σ u = τ 2 τ p, Σ u = (I p ) τ p G p Now we consider the model M u,j given by + τ 2 p G p. M u,j : Y N n p (X j Θ j, Σ u,j I n ), (4.2) where Σ u,j = τ j (I p p G p )+τ 2j p G p. Let H = (h, H 2 ) be an orthognal matrix where h = p /2 p, and let U j = H W j H, W j = Y (I n P j )Y. Let the density function of Y under M u,j denote by f(y; Θ j, τ j, τ 2j ). From (2.3) we have g(τ j, τ 2j ) = 2 log max Θ j f(y; Θ j, τ j, τ 2j ) = np log(2π) + n log τ 2j + n(p ) log τ j + trψ j U j, where Ψ j = diag(τ 2j, τ j,..., τ j ). Then, it can be shown that the maximum likelihood estimators of τ j and τ 2j under M u,j are given by ˆτ j = n(p ) tr D U j = ˆτ 2j = n tr D 2U j = n h Y (I n P j )Yh, n(p ) trh 2Y (I n P j )YH 2, where D = diag(0,,..., ) and D 2 = diag(, 0,..., 0). The number of independent parameters under M u,j is m j = k j p + 2. Noting that Ψ j is diagonal, we can get the information criterion (2.4) given as IC u,d,j = n(p ) log ˆτ j + n log ˆτ 2j + np(log 2π + ) + d(k j p + 2). (4.3) 0

11 Assume that the true model is expressed as M u, : Y N n p (X Θ, Σ u, I n ), (4.4) where Σ u, = τ (I p p G p ) + τ 2 p G p, and denote the minimum model including the true model M u, by M u,j. In general, it holds that U j W p (n k j, Ψ ; j ), where j = (X j Θ j H) (I n P j )X j Θ j H, and Ψ = diag(τ 2, τ,..., τ ). Therefore, we have the following Lemma (see, e.g., Fujikoshi et al. (200)). Lemma 4.. Under the true model (3.4), it holds that () n(p )τ ˆτ j is distributed as a noncentral distribution χ 2 (p )(n k j ) (δ2 j), where δ 2 j = τ trh 2(X j Θ j ) (I n P j )(X j Θ j )H 2. (2) nτ 2 ˆτ 2j is distributed as a noncentral distribution χ 2 n k j (δ 2 2j), where δ 2 2j = τ 2 trh (X j Θ j ) (I n P j )(X j Θ j )h. (3) If j F +, then δ j = 0 and δ 2j = 0. For a sufficient condition for consistency of IC u,d, we assume A3u: For any j F, δ 2 j = O(np), δ 2 2j = O(n) and lim p/n c np δ2 j = η 2 j > 0, lim p/n c n δ2 2j = η2j 2 > 0, (4.5) Theorem 4.. Suppose that the assumptions A, A2 and A3u are satisfied. Then, the information criteria IC u,d defined by (4.3) is consistent if d > and d/n 0.

12 Corollary 4.. Under the assumptions A, A2 and A3u AIC and BIC are consistent. We tried a numerical experiment under the same setting as in the independence covariance structure except for covariance structure. The true uniform covariance structure was set as the one with σ 2 u, = 2, ρ u, = 0.2. The results are given Tables 4. and 4.2. In general, it seems that AIC selects the true model even a finite setting with a high probability. However, BIC does not always select the true model when (n, p) is small, though it has a consistency property. Table 4.. Selection percentages of AIC and BIC for p/n = 0.3 AIC BIC j (50,5) (00,30) (200,60) (50,5) (00,30) (200,60) Table 4.2. Selection percentages of AIC and BIC for p/n = 2 AIC BIC j (50,00) (00,200) (200,400) (50,00) (00,200) (200,400)

13 5. IC under Autoregressive Covariance Structure In this section we consider model selection criterion when the covariance matrix Σ has an autoregressive covariance structure Σ a = σ 2 a(ρ i j a ) i,j p. (5.) Then, it is well known (see, e.g., Fujikoshi et al. (990)) Σ a = (σ 2 a) p ( ρ 2 a) p, Σ a = σ 2 a( ρ 2 a) (ρ2 ac 2ρ a C 2 + C 0 ), where C0 = I p, C = , C 2 = Now we consider the model M a,j given by M a,j : Y N n p ( X j Θ j, Σ a,j I n ), (5.2) where Σ a,j = σa,j(ρ 2 i j a,j ). Then, from (2.3) the maximum likelihood estimate of Θ j is given by ˆΘ j = (X X) X Y, and the maximum likelihood estimators of ρ a,j and σ 2 a,j can be obtained as the minimization of 2 log f(y; ˆΘ a,j, σa,j, 2 ρ j ) = np log(2π) + np log σa,j 2 + n(p ) log( ρ 2 a,j) + σa,j 2 ( ρ2 j ) + tr(ρ2 a,jc 2ρ a,j C 2 + C 0 )Y (I n P j )Y with respect to σ a,j and ρ a,j. Therefore, the maximum likelihood estimators of σ 2 a,j and ρ a,j are given (see Fujikoshi et al. (990)) through the following 3

14 two equations: () ˆσ 2 a,j = n k j n p( ˆρ 2 a,j )(a j ˆρ 2 a,j 2a 2j ˆρ a,j + a 0j ), (5.3) (2) (p )a j ˆρ 3 a,j (p 2)a 2j ˆρ 2 a,j (pa j + a 0j )ˆρ a,j + pa 2j = 0, (5.4) where a ij = trc i S j, i = 0,, 2, and S j = (n k j ) Y (I n P j )Y. Then, the information criterion IC a,d,j can be written as IC a,d,j =np log ˆσ 2 a,j + n(p ) log( ˆρ 2 a,j) + np(log 2π + ) + d(k j p + 2). (5.5) Note that the maximum likelihood estimators ˆρ and σ 2 are expressed in terms of a 0j, a j and a 2j or b 0j = trc 0 W j, b j = trc W j, b 2j = trc 2 W j, (5.6) where W j = (n k j )S j = Y (I n P j )Y. (5.7) Assume that the true model is expressed as M a, : Y N n p (X Θ, Σ a, I n ), (5.8) where Σ a, = σa, (ρ 2 a, i j ), and denote the minimum model including the true model M by M j. Then, W j = Y (I n P j )Y W p (n k j, Σ a, ; Ω j ), where Ω j = (X j Θ j ) (I n P j )X j Θ j. For relating to the noncentrality matrix Ω j, we use the following three quantities: and δ ij = trc i Ω j, i = 0,, 2. (5.9) As a sufficient condition for consistency, we assume A3a: For any j F, the order of each element of Ω j is O(n), δ 2 ij = O(np), lim p/n c np δ2 ij = η 2 ij > 0, i = 0,, 2. (5.0) 4

15 Theorem 5.. Suppose that the assumptions A, A2 and A3a are satisfied. Then, the information criteria IC a,d defined by (5.5) is consistent if d > and d/n 0. Corollary 5.. Under the the assumptions A, A2 and A3a AIC and BIC are consistent. We tried a numerical experiment under the same setting as in the independence covariance structure and in the uniform covariance structure except for covariance structure. Here the true covariance structure was set as the autoregressive covariance structure Σ a, = σ 2 a, (ρ i j a, ) with σ 2 a, = 2, ρ a, = 0.2. The results are given Tables 5. and 5.2. It is seen that AIC and BIC are selecting the true model in all the cases. Table 5.. Selection percentages of AIC and BIC for p/n = 0.3 AIC BIC j (50,5) (00,30) (200,60) (50,5) (00,30) (200,60) Table 5.2. Selection percentages of AIC and BIC for p/n = 2 AIC BIC j (50,00) (00,200) (200,400) (50,00) (00,200) (200,400)

16 6. Consitency Properties of EC In this section we are interested in consistency properties of three efficient criteria EC v,d, EC u,d and EC a,d defined from IC v,d, IC u,d and IC a,d through (2.7). Note that consistency properties of IC v,d, IC u,d and IC a,d are given in Theorems 3., 4. and 5.. Using these consitency properties it is expected that EC v,d, EC u,d and EC a,d have similar consistency properties. In fact, the following result holds. Theorem 6.. Suppose that the assumptions A and A2 are satisfied. Then, it holds that () the efficient criterion EC v,d is consistent under the assumption A3v if d > and d/n 0. (2) the efficient criterion EC u,d is consistent under the assumption A3u if d > and d/n 0. (3) the efficient criterion EC a,d is consistent under the assumption A3a if d > and d/n 0. In order to examine the validity of the results and the speed of convergences we tried a numerical experiment. The simulation settings are similar to the cases of IC v,d, IC u,d and IC a,d except for that the following points: The total number of explanatory variables to be selected was changed to 5 from 0. The true covariance structures were set as follows for IND(independent covariance structure), UNIF(uniform covariance structure) and AUTO(autoregressive covariance structure): IND Σ = σ 2 v, I p, σ 2 v, = 2. UNIF Σ = σ 2 u, (ρ δ ij u, ), σ 2 u, = 2, ρ u, = 0.9. AUTO Σ = σ 2 a, (ρ i j a, ), σ 2 a, = 2, ρ a, = 0.9. Let EC A and EC B be the efficient criterion based on AIC and BIC, respectively. Selection rates of these criteria are given in Tables for each of three covariance structures. In the tables the column of x i denotes the selection rate for the i-th explanatory variable x i. The Under, 6

17 True and Over denote the underspecified models, the true model and the overspecified models, respectively. Table 6.. Selection percentages of EC A and EC B for (n, p) = (20, 0) n = 20, p = 0 Under True Over x x 2 x 3 x 4 x 5 EC A IND UNIF AUTO EC B IND UNIF AUTO Table 6.2. Selection percentages of EC A and EC B for (n, p) = (200, 00) n = 200, p = 00 Under True Over x x 2 x 3 x 4 x 5 EC A IND UNIF AUTO EC B IND UNIF AUTO Table 6.3. Selection percentages of EC A and EC B for (n, p) = (0, 20) n = 0, p = 20 Under True Over j = j = 2 j = 3 j = 4 j = 5 EC A IND UNIF AUTO EC B IND UNIF AUTO

18 Table 6.4. Selection percentages of EC A and EC B for (n, p) = (00, 200) n = 00, p = 200 Under True Over j = j = 2 j = 3 j = 4 j = 5 EC A IND UNIF AUTO EC B IND UNIF AUTO Selection of Covariance Structures In this section we consider high-dimensional properties of AIC and BIC for selection of covariance structures. The covariance structures considered are () IND(independent covariance structure), (2) UNIF(uniform covariance structure), (3) AUTO(autoregressive covariance structure), (4) NOST(no covariance structure). For the first three covariance structures, we use the same notation as in Sections 3, 4 and 5. The covariance matrix under NOST is denoted by Σ m,j. When E(Y) = X j Θ j, AIC criteria for these four models are expressed as A v,j = np log ˆσ 2 vj + np(log 2π + ) + 2(jp + ), A u,j = n log ˆτ 2j + n(p ) log ˆτ j + np(log 2π + ) + 2(jp + 2), A a,j = np log ˆσ aj 2 + n(p ) log( ˆρ 2 aj) + np(log 2π + ) + 2(jp + 2), A m,j = n log ˆΣ j + np(log 2π + ) + 2 {jp + 2 } p(p + ), The expressions for ˆσ vj, 2 ˆτ j, ˆτ 2j, ˆσ aj 2 and ˆρ 2 aj are given in Sections 3, 4 and 5. The BIC are defined from AIC by replacing 2 in the 2 the number of independent parameters to log n. In order to see high-dimensional behaviors of AIC and BIC, simulation experiments were done. For multivariate regression models, we selected k = 3, k ω = 5. 8

19 The multivariate regression model was set by the same way as in Sections 3, 4 and 5. For the first covariance structure, σ 2 = 2. For the second and third covariance structures, σ 2 = 2, ρ = 0.9. For no-strctured case, the true covariance matrix Σ follows: ( σ ii = σ i ), i =,..., p, p = (σ ij ) was set as The other elements of Σ = (σ ij ) are i.i.d. from U(0.3, 0.9). The simulation results are given in Table 7.. Table 7.. Selection rates of AIC for the four covariance structures n = 20, p = 0 n = 200, p = 00 TRUE IND UNIF AUTO NOST IND UNIF AUTO NOST IND UNIF AUTO NOST It is seen that though it is difficult to select IND covariance structure correctively, but the other three model will be correctly selected. IND covariance structure can be seen as a limit of UNIF and AUTO covariance matrices. So, it seems that it is difficult to select IND covariance structure correctively. On the other hand, we consider another independent covariance structure with different variances given by Σ = diag(σ 2,..., σ 2 p) whose structure is expressed as DIAG. The AIC is given by AIC 5j = n p log ˆσ ii 2 + np(log 2π + ) + 2(jp + p), i= where ˆσ 2 ii = n e iw j e i, W j = Y (I n P j )Y, and e i is the p component vector with i-th component and other components zero. In our simulation 9

20 experiment, the variances were defined by ( σi 2 = σ i ), i =,..., p, p where σ 2 = 2. The simulation result is given in Table 7.2. Table 7.2. Selection rates of AIC for the five covariance structures n = 20, p = 0 TRUE IND UNIF AUTO NOST DIAG IND UNIF AUTO NOST DIAG n = 200, p = 00 TRUE IND UNIF AUTO NOST DIAG IND UNIF AUTO NOST DIAG It is seen that AIC is consistent for selection of the five covariance structures in high-dimensional situation. Similar experiments were done for BIC. The result for selection of the five covariance structures is given in Table 7.3. It seems that BIC coverges more firstly to the true model than AIC, except for NOST. BIC chooses UNIF when the true is NOST. This will come from that our setting for NOST will be near UNIF. 20

21 Table 7.3 Selection rates of BIC for the four covariance strucures n = 20, p = 0 TRUE IND UNIF AUTO NOST DIAG IND UNIF AUTO NOST DIAG n = 200, p = 00 TRUE IND UNIF AUTO NOST DIAG IND UNIF AUTO NOST DIAG Concluding Remarks In this paper, firstly we consider to select regression variables in p variate regression model with one of three covariance structures; () IND(independent covariance structure), (2) UNIF(uniform covariance structure), (3) AUTO (autoregressive covariance structure). As a selection method, a general information IC g,d was considered for each of three covariance structures, where d is a positive constant and may depend on the sample size n. When d = 2 and log n, IC g,d becomes to AIC and BIC, respectively. Under a high-dimensional asymptotic framework p/n c (0, ), it was shown that IC g,d with g = v, u or a is consistent under the assumption of A3g if d/n 0 and d >. Further, in order to avoid a computational problem of IC g,d, we study EC g,d. It was pointed that EC g,d has a consistency property similar to IC g,d. The result was obtained by assuming normality. It is left to extend the result to the case of non-normality. 2

22 Next, under multivariate regression model, we examined to select covariance structures. The covariance structures picked up are (4) NOST(no covariance structure) and (5) IND (independent covariance structure), in addition to the three covariance structures (), (2) and (3). Two criteria AIC and BIC were examined through a simulation experiment in a high-dimensional setting, It was seen that (i) the four covariance structures except IND can be selected correctly by using AIC, (ii) BIC converges to the true model more firstly than AIC, except for IND. The proofs of theoretical results on these properties will be given in a future work. Appendix: The Proofs of Theorems 3., 4. and 5. A. Outline of Our Proofs and Preliminary Lemma First we explain an outline of our proof. In general, let F be a finite set of candidate models j(or M j ). Assume that j is the minimum model including the true model and j F. Let T j (n) be a general criterion for model j, which depends on parameters p and n. The best model chosen by minimizing T j (p, n) is written as ĵ T (p, n) = arg min j F T j (p, n). Suppose that we are interested in asymptotic behavior of ĵ T (p, n) when p/n tends to c > 0. In order to show a consistency of T j (p, n), we may check a sufficient condition such that for any j j F, there exists a sequence {a p,n } with a p,n > 0, a p,n {T j (p, n) T j (p, n)} p b j > 0. In fact, the condition implies that for any j j F, and P (ĵ T (p, n) = j) P (T j (p, n) < T j (p, n)) 0, P (ĵ T (p, n) = j ) = j j F P (ĵ T (p, n) = j). For the proofs of Theorems 3., 4. and 5., we use the following Lemma frequently. 22

23 Lemma A. Suppose that a p p symmetric random matrix W is distributed as a noncentaral Wishart distribution W p (n k, Σ; Ω). Let A be a given p p symmetric matrix. We consider asymptotic behavior of traw when p and n are large as in the way such that p/n c (0, ), where k is fixed. Suppose that Then, it holds that lim p traσ = a2 > 0, lim np traω = η2 > 0. T p,n = np traw p a 2 + η 2. Proof. Let m = n k. We may write W as W = Σ /2 (z z + + z m z m)σ /2, (A.) (A.2) where z i N p (ζ i, I p ), i =,..., m and z i s are independent. Here, Ω = µ µ + + µ m µ m and ζ i = Σ /2 µ i, i =,..., m. Note that traw is expressed as a quadratic form of z = (z,..., z m) as follows: traw = z Bz, where B = I m Σ /2 AΣ /2. Note that z N mp (ζ, I mp ), where ζ = (ζ,..., ζ m). Then, it is known (see, e.g., Gupta ) that for any symmetric matrix B, E[z Bz] = trb + ζ Bζ, Var(z Bz) = 2trB 2 + 4ζ B 2 ζ. Especially, when B = I m Σ /2 AΣ /2, we have E[z Bz] = mtraσ + traω, Var(z Bz) = 2mtr (AΣ) 2 + 4trAΣAΩ. Under the assumption (A.), we have E(T p,n ) = m np traσ + np traω a 2 + η 2. 23

24 Further, Var(T p,n ) = 2m (np) 2 tr(aσ)2 + 4 (np) traσaω 2 2m (np) 2 (traσ)2 + 4 (np) traσtraω 2 0. These imply our conclution. to In a special case A = I p and Σ = σ 2 I p, the assumptions in (A.) become The conclusion is that lim p traσ = lim p pσ2 = σ 2, lim traω = lim np np trω p η 2. T p,n = np trw p σ 2 ( + δ 2 ), (A.3) where δ 2 = (/σ 2 )η 2. Note that (/σ 2 )trw χ 2 (n k)p (δ2 ). This implies that σ 2 trw χ2 (n k)p(δ 2 ). Therefore, under a high-dimensional asymptotic framework p/n c (0, ), np χ2 (n k)p(δ 2 ) p + η 2, if (np) δ 2 η 2. More generally, we use the following result. np χ2 (n k)(p h)(δ 2 ) p + η 2, (A.4) (A.5) if (np) δ 2 η 2, where k and h are constants or more generally, they may be the constants satisfying k/n 0 and h/n 0. following property for noncentral χ 2 -square distribution. if n δ 2 η 2. n χ2 n k(δ 2 ) p + η 2, Further, we use the (A.6) 24

25 A2. The Proof of Theorem 3. Using (3.3) we can write IC v,d,j IC v,d,j = np log ˆσ 2 v,j np log ˆσ 2 v,j + d(k j k j )p. Lemma 3. shows that np σ 2 ˆσ2 j χ 2 (n k j )p(δ 2 j ), δ 2 j = σ 2 tr(x j Θ j ) (I n P j )(X j Θ j ). In particular, np σ 2 ˆσ2 j χ 2 (n k j )p. First, consider the case j j. Under the assumption A3v, np np σ 2 ˆσ2 j = (n k j)p np (n k j )p σ 2 ˆσ2 p j + ηj 2, np np σ 2 ˆσ2 j = (n k j )p np (n k j )p σ 2 ˆσ2 p j. Therefor np (IC v,g,j IC v,g,j ) = log ˆσ j 2 log ˆσ j 2 + d n (k j k j ) ( ) ( ) np np = log np σ 2 ˆσ2 j log np σ 2 ˆσ2 j + d n (k j k j ) p log( + ηj 2 ) + log + 0 = log( + ηj 2 ) > 0, when g/n 0. Next, consider the case j j. Then Further, IC v,g,j IC v,g,j = np log ˆσ2 v,j ˆσ 2 v,j + d(k j k j )p. log ˆσ2 j = log try (I n P j )Y ˆσ j 2 try (I n P j )Y ( ) = log + try (P j P j )Y try (I n P j )Y ( ) = log + χ2 (k j k j )p χ 2 (n k j )p. 25

26 Using the fact that χ 2 m/m as m, we have { } n log ˆσ2 j = n log + (k j k j ) χ 2 (k j k )p/((k j j k j )p) ˆσ j 2 (n k j ) χ 2 (n k j )p /((n k, j)p) and hence p {IC v,d,j IC v,d,j } = n log ˆσ2 j ˆσ j + 2(k j k j ) if d >. p (k j k j ) + d(k j k j ) = (d )(k j k j ) > 0, A3. The Proof of Theorem 4. Using (4.3) we can write IC u,d,j IC u,d,j = n(p )(log ˆτ 2 j log ˆτ 2 j ) Lemma 4. shows that for a general j ω, + n(log ˆτ 2 2j log ˆτ 2 2j ) + d(k j k j )p. npτ ˆτ j χ 2 (p )(n k j )(δ 2 j), nτ 2 ˆτ 2j χ 2 n k j (δ 2 2j), (A.7) (A.8) where δ 2 j = τ trh 2(X j Θ j ) (I n P j )(X j Θ j )H 2, δ 2 2j = τ 2 trh (X j Θ j ) (I n P j )(X j Θ j )h. Note thta if j j, δ j = 0 and δ 2j = 0. First consider the case j j. Then, using (A.5) and (A.7), we have ˆτ j = χ2 (n k j )p (δ2 j) ˆτ j χ 2 (n k j )p p + η 2. 26

27 Similarly, using (A.6) and (A.8), we have ˆτ 2j = χ2 (n k j )p (δ2 2j) ˆτ 2j χ 2 (n k j )p These results imply the following: if d/n 0. np (IC u,d,j IC u,d,j ) = n(p ) np Next, consider the case j j. Then Similarly p + η 2. log(ˆτ j /ˆτ j ) + p log(ˆτ 2j/ˆτ 2j ) + d n (k j k j ) log( + η 2 j) > 0 log ˆτ j = log trh 2Y (I n P j )YH 2 ˆτ j trh 2Y (I n P j )YH 2 ( = log + trh 2Y ) (P j P j )YH 2 trh 2Y (I n P j )YH 2 ( ) = log + χ2 (k j k j )(p ) χ 2 (n k j )(p ) k j k j n k j. log ˆτ 2j = log h Y (I n P j )Yh ˆτ 2j vh Y (I n P j )Yh ( = log + h Y ) (P j P j )Yh h Y (I n P j )Yh ( ) = log These results imply the following: + χ2 k j k j χ 2 n k j k j k j n k j. p (IC u,d,j IC u,d,j ) p (k j k j ) + d(k j k j ) = (d )(k j k j ) > 0, if d >. Combining these to the result in the case j j, we get Therem

28 A4. The Proof of Theorem 5. In this section, for notational simplification, we put Σ a,, σ 2 a, and ρ a, into Σ, σ 2 and ρ, respectively. Using (5.5) we can write IC a,d,j IC a,d,j = np log(ˆσ 2 a,j/ˆσ 2 a,j ) + n(p ) log { ( ˆρ 2 a,j)/( ˆρ 2 a,j ) } + d(k j k j )p. Further, using (5.3), the above expression can be expressed as IC a,d,j IC a,d,j = np log {(n k j )/(n k j )} n log { ( ˆρ 2 j)/( ˆρ 2 j ) } + np log { (a j ˆρ 2 j 2a 2j ˆρ j + a 0j )/(a j ˆρ 2 j 2a 2j ˆρ j + a 0j ) } + d(k j k j )p. (A.9) Here, the maximum likelihood estimators ˆσ 2 j and ˆρ j are defined in terms of a ij = trc i S j, i = 0,, 2, (n k j )S j = W j = Y (I n P j )Y W p (n j, Σ; Ω j ). For the definition of C i and Ω j, see Section 5. We also use the notation b ij = trc i W j, i = 0,, 2, W j = (n k j )S j = Y (I n P j )Y W p (n k j, Σ; Ω j ). First, consider the case j j. Under the assumption A3a, it is possible to get asymptotic behavior of a ij and b ij which is given in lator. It is shown that and more precisely ˆρ j = a 2j a j + p ˆρ j = a 2j a j + O p (p ), { a } 2(a 0 a a 2 2) + O p (p 2 ). a (a a 2 )(a + a 2 ) These imply that ( ) log( ˆρ 2 j) = log a2 2j + O a 2 p (p ), j ( ) log(a j ˆρ 2 j 2a 2j ˆρ j + a 0j ) = log a 0j a2 2j + O p (p 2 ). a j 28

29 The above expansions hold also for j = j. (A.9) and noting b ij = (n k j )a ij, we have Substituting these results to np (IC a,d,j IC a,d,j ) = { ( ) ( )} log b2 2j log b2 2j (A.0) p b 2 j b 2 j { ( ) ( )} + log b 0j b2 2j log b 0j b2 2j + d b j b j n (k j k j ) + O p (p 2 ). Using Lemma A, under the assumption A3a we have np b 0j = n k j n np b j = n k j n np b 2j = n k j n (n k j )p b 0j (n k j )p b j (n k j )p b 2j p σ 2 + η 2 0j, p σ 2 + η 2 j, p σ 2 ρ + η 2 2j, p trc 0Σ = p p σ2 σ 2, p trc Σ = p 2 p σ2 σ 2, p trc 2Σ = p p σ2 ρ σ 2 ρ. Asymptotic behaviors for b ij are obtained from the ones by putting η ij = 0. These imply that np (IC a,d,j IC a,d,j ) p log (σ 2 + η 20j (σ2 ρ + η2j) 2 2 ) log σ 2 ( ρ 2 ), σ 2 + ηj 2 if d/n 0. Now we show log (σ 2 + η 20j (σ2 ρ + η2j) 2 2 ) log σ 2 ( ρ 2 ) > 0. σ 2 + ηj 2 (A.) Note that np (trc 0Ω j trc Ω j ) = np tr(c 0 C )Ω j = np (the sum of (, ) and (p, p) elements of Ω j) 0, 29

30 which implies η 2 0j = η 2 j. Using this equality, we can express log (σ 2 + η 20j (σ2 ρ + η2j) 2 2 ) log σ 2 ( ρ 2 ) σ 2 + ηj 2 ( ) ( = log + η2 0j + log + η2 0j η 2 ) ( 2j + log + η2 0j + η 2 ) 2j. σ 2 σ 2 ( ρ) σ 2 ( + ρ) The first and the third terms are positive. The quantity η 2 0j η 2 2j in the second term can be expressed as the limit of np (trc 0Ω j trc 2 Ω j ) = np tr(c 0 C 2 )Ω j. Here, the matrix C 0 C 2 is expressed as C 0 C 2 = , and x (C 0 C 2 )x = x 2 x x 2 + x 2 2 x 2 x 3 + x x 2 p x p x p + x 2 p = 2 { x 2 + (x x 2 ) (x p x p ) 2 + x 2 p} > 0, except for the case x = = x p. Therefore, the second term is nonnegative, and (IC a,d,j IC a,d,j )/(np) converges to a positive constant if d/n 0. Next, consider the case j j. From (A.0) we have p (IC a,d,j IC a,d,j ) = n { ( ) log b2 2j p b 2 j { ( ) + n log b 0j b2 2j b j ( log b2 2j b 2 j )} ( log b 0j b2 2j b j 30 )} + d(k j k j ) + O p (p ).

31 Then, it is easely seen that { ( ) ( )} n log b2 2j log b2 2j p b 2 j b 2 j p { log( ρ 2 ) log( ρ 2 ) } = 0. c Note that { ( ) ( )} n log b 0j b2 2j log b 0j b2 2j b j b j = n { (log b j log b j ) + log(b 0j b j b 2 2j) log(b 0j b j b 2 2j ) }. Further, we use the following relation between b ij and b ij : b ij = trc i Y (I n P j )Y = trc i Y (I n P j )Y trc i Y (P j P j )Y = b ij b ij+, where b ij+ = trc i Y (P j P j )Y. Since Y (P j P j )Y W p (k j k j, Σ), it holds that (k j k j )p b 0j + p σ 2, (k j k j )p b j + p σ 2, (k j k j )p b 2j + p σ 2 ρ. Therefore, noting that b ij = O p (np) and b ij+ = O p (p). we have n(log b j log b j ) = n log ( b j+ b j ) nbj+ b j, and hence Consider n(log b j log b j ) p (k j k j ). f = n{log(b 0j b j b 2 2j) log(b 0j b j b 2 2j )}. Sustituting b ij = b ij b ij+ to the above expression, we have f = n log( h) nh, 3

32 where h = (b 0j b 0j+ ) (b j b j+ ) (b 0j b j+ + b j b 0j+ 2b 2j b 2j+ b 0j+ b 0j+ + b 2 2j + ) Considering the limiting values of each terms in h, we have f p (k j k j ) ( ρ 2 ) (k j k j ) ( ρ 2 ) + 2(k j k j )ρ ( ρ 2 ) = 2(k j k j ). Summerizing the above results, for j j, p (IC a,d,j IC a,d,j ) p (k j k j ) 2(k j k j ) + d(k j k j ) = (d )(k j k j ) > 0, if d >. This completes the proof. A5. The Proof of Theorem 6. For a notational simplicity, we denote IC g,d,j by IC(j). Note that j ω = {, 2,..., k} and k is finite. Without loss of generality, we may assume that the true model is j = {,..., b} and b = k j. Under the assumption A, A2 and A3 3, it was shown that our information criterion ĵ IC has a highdimensiona consistency property. proving that () for j F, and (2) for j F + and j j, The consitency property was shown by m {IC(j) IC(j )} p γ j > 0, (A.2) m 2 {IC(j) IC(j )} p (d )(k j k j ) > 0, (A.3) where d >, m = np and m 2 = p. 32

33 From the definition of our efficient criterion ĵ EC, we have P (ĵ ED = j ) = P (IC(j () ) IC(j ω ) > 0,..., IC(j (b) ) IC(j ω ) > 0, IC(j (b+) IC(j ω ) < 0,..., IC(j (k) ) IC(j ω ) < 0) ( b k ) = P (IC(j (i) ) IC(j ω )) > 0) (IC(j (i) ) IC(j ω )) < 0), i= which can be expressed as ( b P (IC(j (i) ) IC(j ω ) < 0) = i= i=b+ k i=b+ b P (IC(j (i) ) IC(j ω ) < 0) i= ) (IC(j (i) ) IC(j ω ) > 0) k j=b+ b { P (IC(j (i) ) IC(j ω ) > 0)} i= k { P (IC(j (i) ) IC(j ω ) < 0)}. j=b+ Therefore, we have P (IC(j (i) ) IC(j ω ) > 0) P (ĵ ED = j ) i j { P (IC(j (i) ) IC(j ω ) > 0)} i j ω /j { P (IC(j (i) ) IC(j ω ) < 0)}. Now, consider to evaluate the following probabilities: (A.4) i j, P (IC(j (i) ) IC(j ω ) > 0), i j ω /j, P (IC(j (i) ) IC(j ω ) < 0). When i j, j (i) F, and hence, using (A.2), it holds that m (IC(j (i) ) IC(j ω )) = m (IC(j (i) ) IC(j )) m (IC(j ω ) IC(j )) p γ i + 0 = γ i > 0. 33

34 When i j ω /j, j (i) F + and hence, using (A.3), it holds that m 2 (IC(j (i) ) IC(j ω )) = m 2 (IC(j (i) ) IC(j )) m 2 (IC(j ω ) IC(j )) These results imply that p (d )(k b) (d )(k b) = (d ) < 0. i j, lim P (IC(j (i) ) IC(j ω ) > 0) =, i j ω /j, lim P (IC(j (i) ) IC(j ω ) < 0) =. Using the above results, we can see that the right-hand side of (A.4) tends to [ { } + j j ] { } =, j j k /j and P (ĵ ED = j ). This completes the proof of Theorem 6.. Acknowledgements The second author s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 6K00047, References [] Akaike, H. (973). Information theory and an extension of the maximum likelihood principle. In 2nd. International Symposium on Information Theory (eds. B. N. Petrov and F. Csáki), , Akadémiai Kiadó, Budapest. [2] Bedrick, E. J. and Tsai, C.-L. (994). Model selection for multivariate regression in small samples. Biometrics, 50,

35 [3] Fujikoshi, Y., Kanda, T. and Tanimura, N. (990). The growth curve model with an autoregressive covariance structure. Ann. Inst. Statist. Math., 42, [4] Fujikoshi, Y. and Satoh, K. (997). Modified AIC and C p in multivariate linear regression. Biometrika, 84, [5] Fujikoshi, Y., Sakurai, T. and Yanagihara, H. (203). Consistency of high-dimensional AIC-type and Cp-type criteria in multivariate linear regression. J. Multivariate Anal., 49, [6] Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (200). Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, Hobeken, N.J. [7] Nishii, R. (984). Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Statist., 2, [8] Nishii, R., Bai, Z. D. and Krishnaia, P. R. (988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Math. J., 8, [9] Rothman, A., Levina, E. and Zhu, J. (200). Sparse multivariate regression with covariance estimation. J. Comput. Graph.. Stat., 9, [0] Yanagihara, H., Wakaki, H. and Fujikoshi, Y. (205). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electron. J. Stat., 9, [] Zhao, L. C., Krishnaia, P. R. and Bai, Z. D. (986). On detection of the number of signals in presence of white noise. J. Multivariate Anal., 20,

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Yasunori Fujikoshi and Tetsuro Sakurai Department of Mathematics, Graduate School of Science,