A high-dimensional bias-corrected AIC for selecting response variables in multivariate calibration

Size: px

Start display at page:

Download "A high-dimensional bias-corrected AIC for selecting response variables in multivariate calibration"

Miranda Mitchell
5 years ago
Views:

1 A high-dimensional bias-corrected AIC for selecting response variables in multivariate calibration Ryoya Oda, Yoshie Mima, Hirokazu Yanagihara and Yasunori Fujikoshi Department of Mathematics, Graduate School of Science, Hiroshima University -3- Kagamiyama, Higashi-Hiroshima, Hiroshima , Japan Last Modified: December 2, 208 Abstract In a multivariate linear regression with a p-dimensional response vector y and a q- dimensional explanatory vector x, we consider a multivariate calibration problem requiring the estimation of an unknown explanatory vector x 0 corresponding to a response vector y 0, based on y 0 and n-samples of x and y. We propose a high-dimensional bias-corrected Akaike s information criterion HAIC C for selecting response variables. To correct the bias between a risk function and its estimator, we use a hybrid-high-dimensional asymptotic framework such that n tends to but p/n does not exceed. Through numerical experiments, we verify that the HAIC C performs better than a formal AIC. Introduction Multivariate linear regression is very widely used in various fields. Let y = y,..., y p and x = x,..., x q be a p-dimensional response vector and a q-dimensional nonstochastic explanatory vector, respectively. A multivariate linear regression is written as y = α + B x + ε, where α is a p-dimensional unknown vector of intercept coefficients, B is a q p unknown matrix of regression coefficients, and ε is a p-dimensional error vector. We assume that ε is distributed according to the p-dimensional normal distribution with a mean vector 0 p, which is a p-dimensional vector of zeros, and a p p covariance matrix Σ. Further, assume that there are n observations y, x,..., y n, x n, which are expressed as Y = y,..., y n : n p and X = x,..., x n : n q in matrix notation, and X is centralized X n = 0 q, where n is the n-dimensional vector of ones. Our model selection criterion is defined for the case that rankx = q p and n p q 2 > 0. Corresponding author. oda.stat@gmail.com Current address: Institute of educational foundation Fukuyama Akenohoshi, 3-4- Nishifukatsucho, Fukuyama-shi, Hiroshima , Japan.

2 In actual empirical contexts, calibration is often needed and is widely applied e.g., Alves and Poppi, 206; Bro, 2003; Carvalho et al., 205. A typical calibration problem can be described as follows. Suppose that a value y 0 of y in has been observed, but the corresponding value x 0 of x is unknown. Then the problem is to make an inference about an unknown vector x 0, based on y 0, Y, and X. Brown 982 summarized various aspects of this problem. There is considerable literature on point estimators of x 0 and confidence regions for x 0 in the controlled calibration model. Another calibration problem involves removing redundant response variables in estimating x 0. Generally, estimating a reduced model on a subset of the available data leads to an inferior fit. This will be the case when all the parameters are known. However, it is understood that when some response variables are redundant, we can expect better prediction by neglecting these redundant response variables. In this vein, two approaches are salient: one based on test procedures for a redundancy hypothesis and the other based on model selection procedures for selecting response variables. Brown 982 proposed a procedure based on a test for redundancy of response variables. In the context of model selection, Fujikoshi and Nishii 986 proposed a formal Akaike s information criterion AIC for the redundancy of response variables in estimating x 0. In the AIC, the goodness of fit of a model is measured by the Kullback-Leibler KL discrepancy function. The best model is defined as that where the risk function defined by the expected KL loss function is lowest among all the redundancy models. The AIC is defined by adding an estimator of the bias between the risk function and the expectation of a negative twofold maximum log-likelihood to the negative twofold maximum log-likelihood, and the AIC is regarded as an estimator of the risk function. Following Akaike 973, Fujikoshi and Nishii 986 formally regarded the twofold number of parameters in the redundancy model as an estimator of bias. However, it is known that when the sample size n is small or the number of redundancy models is large, a formal AIC does not estimate the risk function well and tends to choose a model such that many response variables are not redundant. Therefore, it is important to correct the bias in order to better estimate the risk function. For selecting explanatory variables in an ordinary linear regression, Bedrick and Tsai 994, Hurvich and Tsai 989, and Sugiura 978 corrected the bias for overspecified models. However, these corrections did not consider the selection of response variables in multivariate calibration. In recent years, it has become commonplace to analyze high-dimensional data such that not only the sample size n but also the dimension p are large. Bedrick and Tsai 994, Hurvich and Tsai 989 and Sugiura 978 used a large sample LS asymptotic framework such that n only tends to in order to correct bias. However, corrections using the LS asymptotic framework become inadequate when the response dimension p is large. It is known that corrections using the high-dimensional HD asymptotic framework such that both n and p tend to are better than those using the LS asymptotic framework. Fujikoshi et al. 204 offered a bias correction using the HD asymptotic framework in the context of selecting explanatory variables in a multivariate linear regression. It can be expected that such a correction performs well when both n and p are large but is inadequate when p is small. However, it can be difficult for analysts to decide whether p is large or small; thus, when p is moderate, it can be burdensome to determine whether corrected AICs by the LS or HD asymptotic framework should be applied. In this paper, we correct the bias under a hybrid-high-dimensional HHD asymptotic frame- 2

3 work: n, p c [0,. n It should be noted that the LS and HD asymptotic frameworks are included in the HHD asymptotic framework as special cases. Thus, it is expected that the bias-corrected AIC determined by using the HHD asymptotic framework is a better estimator of the risk function regardless of the size of p. We refer to this bias-corrected AIC as the high-dimensional bias-corrected AIC HAIC C. The remainder of the paper is organized as follows. In section 2, we introduce a framework for selecting response variables in multivariate calibration. In section 3, the HAIC C is proposed and we obtain an asymptotic property of the HAIC C. In section 4, we explore, and verify, the performance of the HAIC C through conducting numerical simulations. relegated to the Appendix. Technical details are 2 A framework for selecting response variables 2. A redundancy hypothesis of response variables Let Z = n, X and Θ = α, B be n q + and q + p matrices. multivariate linear regression for Y and X is written as Then, the Y = ZΘ + E, 2 where E = ε,..., ε n is the n p error matrix and ε,..., ε n are mutually independent and identically distributed according to N p 0 p, Σ assuming that Σ is positive definite. To introduce the redundancy hypothesis of response variables by Fujikoshi and Nishii 986, we prepare a classical estimator of x 0. If the population parameters α, B, and Σ are known, then the classical estimator x 0 is written as x 0 = arg min x 0 {y 0 α B x 0 Σ y 0 α B x 0 } = BΣ B BΣ y 0 α. 3 We partition y = y, y 2. Without loss of generality, let y and y 2 be the p -dimensional and p p -dimensional vectors of the candidate redundant variables and non-redundant variables, respectively. To define the redundancy hypothesis, we consider only the case q p < p. We express the partitions of y 0, α, B, and Σ corresponding to the division of y = y, y 2 as follows: y 0 = y 0,, y 0,2, α = α, α 2 Σ Σ 2, B = B, B 2, Σ =, 4 Σ 2 Σ 22 where y 0, and y 0,2 are p -dimensional and p p -dimensional vectors, α and α 2 are p - dimensional and p p -dimensional vectors, B and B 2 are q p and q p p matrices, Σ and Σ 22 are the p p and p p p p covariance matrices of y and y 2, and Σ 2 is the p p p covariance matrix of y and y 2. Let C = BΣ B BΣ. We also partition the classical estimator x 0 in 3 as x 0 = Cy 0 α = C y 0, α + C 2 y 0,2 α 2, 3

4 where C and C 2 are the q p and q p p submatrices of C = C, C 2. From Fujikoshi and Nishii 986, the hypothesis such that y 2 is redundant for estimating x 0 is written as C 2 = B 2 B Γ = O q,p p, 5 where Γ = Σ Σ 2 and O q,p p is the q p p matrix of zeros. 2.2 Models, Risk, and Bias Let Y and Y 2 be the n p and n p p partitioned matrices of Y = Y, Y 2, and let Θ = α, B, Θ 2 = α 2, B 2, and Θ 2 = Θ 2 Θ Γ. From a property of a conditional distribution of a multivariate normal distribution e.g., Srivastava and Khatri, 979, we can express 2 as follows: Y N n p ZΘ, Σ I n, Y 2 Y N n p p Z Θ 2 + Y Γ, Σ 22 I n, where Σ ab c = Σ ab Σ ac Σ cc Σ ca. Under hypothesis 5, it is straightforward to observe that Z Θ 2 = n δ, where δ = α 2 Γ α. Therefore, the candidate model M such that y 2 is redundant is expressed as Y N n p ZΘ, Σ I n, Y 2 Y N n p p n δ + Y Γ, Σ 22 I n. 6 Let S = Y, Y 2, X I n J n Y, Y 2, X be n times as large as the sample covariance matrix of y, x, where J n = n n n. We partition S as follows: S S 2 S x S = S 2 S 22 S 2x, S x S x2 S xx where the sizes of S, S 22, and S xx are p p, p p p p, and q q, respectively. Let fy ; Θ, Σ be the probability density function of N n p ZΘ, Σ I n. Then, the log-likelihood function is derived as lθ, Σ; Y, X = 2 log fy ; Θ, Σ = [ np log 2π + n log Σ + tr{σ Y ZΘ Y ZΘ} ]. 2 By maximizing lθ, Σ; Y, X under model 6, we can obtain the maximum likelihood estimators MLEs of Σ, Σ 22, α, α 2, B, B 2, Θ, Θ 2, Γ, and δ as follows the proof is given in Appendix A: Σ = n Y I n P Z Y, Σ22 = n Y 2 I n P n,y Y 2, α = ȳ S x Sxx x, B = Sxx S x, Θ = Z Z Z Y = α B α 2 = ȳ 2 S 2 S S xs xx x, B2 = Sxx S x S S 2, ȳ, Θ2 = 2 x Sxx S x S S 2 Sxx S x S S, 2 Γ = S S 2, δ = ȳ2 Γ ȳ, 4

5 where P A = AA A A for a matrix A, ȳ = n Y n, ȳ 2 = n Y 2 n, and x = n X n. We assume that the data are generated from the following true model M : Y N n p, Σ I n, y 0 N p ξ, Σ, where is an n p true mean matrix, ξ is a p-dimensional true mean vector, and Σ is a p p true covariance matrix assuming that Σ is positive definite. Under 6, we partition the true covariance matrix Σ in the same way as Σ, as follows: Σ Σ 2 Σ =. Σ 2 Let Γ = Σ Σ 2. We state that the candidate model M is an overspecified model when M satisfies Σ 22 = ZΘ, B 2 B Γ = O q,p p, 7 where Θ = α, B is a q+ p true unknown matrix, α is the p-dimensional true unknown vector of intercept coefficients, B is the q p true unknown matrix of regression coefficients, and B and B 2 are q p and q p p submatrices of B = B, B 2. Let LΘ, Σ be the expected negative twofold log-likelihood function as follows: LΘ, Σ = E Y [ 2lΘ, Σ; Y, X] = 2lΘ, Σ;, X + ntrσ Σ, 8 where EY is the expectation with respect to Y under the true model M. Then, we define the risk function R KL as R KL = EY [L Θ, Σ]. This type of risk function is essentially the same as the expected KL discrepancy function between the true model and a redundancy model. Although the best model is defined as the candidate model with the smallest risk function, we need to estimate the risk function because the risk function includes unknown parameters. Therefore, we usually estimate R KL as 2l Θ, Σ; Y, X, which is expressed as 2l Θ, Σ; Y, X = np{log 2π + } + nlog Σ + log Σ 22, 9 under a candidate model M the proof of 9 is given in Appendix A. However, when R KL is estimated as 9, there is the following bias between the risk function and 2l Θ, Σ; Y, X: B KL = R KL E Y [ 2l Θ, Σ; Y, X]. 0 Although 2l Θ, Σ; Y, X + B KL may be considered as a criterion, note that the bias B KL also includes unknown parameters. Therefore, we usually consider estimating B KL by using its estimator. 5

6 3 Main results 3. High-Dimensional Bias-Corrected AIC Let the number of parameters in the candidate model M be h M = {p + qp + pp + /2}. Following Akaike 973, by approximating B KL as 2h M, the formal AIC is given by { np{log2π + } + nlog AIC = Σ + log Σ h M q p < p np{log2π + } + n log Σ, ω + 2h M p = p where Σ ω = n Y I n P Z Y is the MLE of Σ in the model where none of the response variables y are redundant. This formal AIC has been given by Fujikoshi and Nishii 986 and the model which minimizes the AIC should be selected. However, since the AIC cannot approximate the risk function well when n is small or when n and p are both large, the AIC may not perform adequately in general terms. Therefore, it is important to consider a new criterion that has a better approximation than the AIC in such cases. Correcting the bias B KL in the HHD asymptotic framework, we propose a high-dimensional bias-corrected AIC HAIC C as follows: np{log 2π + } + nlog Σ + log Σ 22 + mn, p q p < p HAIC C = np log2π + n log Σ npn + q + ω + p = p n p q 2 Here, mn, p is defined by mn, p = m n, p + m 2 n, p, m n, p = np + np n + q + + np q + 2n + q + n p n p 2 + n2 p q n p 3 m 2 n, p = + np p n + + 2np p n + n p n p 2 + 4n2 p p n p 3 + np p [ { q n } ] q + 2 n pn p n p n p + nn + p p { p q } n pn p n p n p + 4n{np { q + p }p p n pn p { + 8n2 p p p n pn p n 2 n pn p {2q + τ τ 2 τ 3 }, and τ, τ 2 and τ 3 are given by n p 2 + n p 2 + n pn p }. 2 n p 3 + n p 3 + n pn p 2 + n p 2 n p }, 3 τ = p p { tr{se S e + S h } p q }, τ 2 = n τ 2, 4 n p n p τ 3 = np p n p 2 { tr[{se S e + S h } 2 ] p q } 5 6

7 where S e = n ˆΣ = Y I n P Z Y and S h = Y P X Y. Practically, to propose the HAIC C when q p < p, we took the following step. We expanded the bias B KL in overspecified models the result is given in Appendix B, Proposition B.2. Since there are unknown parameters included in the expanded term, we replaced the unknown parameters as τ, τ 2 and τ 3. Therefore, the HAIC C is proposed using mn, p instead of B KL. When p = p, the redundancy model means the usual multivariate linear regression such that all response variables y are not redundant. Therefore, from Bedrick and Tsai 994, the bias for p = p is exactly equal to np + {npn + q + }/n p q Asymptotic property of HAIC C To present an asymptotic property of the HAIC C, we start by offering some notation and delineating assumptions. Let Ξ be the p p constant matrix given by Ξ = B X XB. 6 Note that Σ /2 Ξ Σ /2 is the p p symmetric matrix and rankσ /2 Ξ Σ /2 min {p, q} = q. Therefore, its spectral decomposition can be expressed as Σ /2 Ξ Σ /2 = H Λ O p q,q O q,p q O p q,p q H, 7 where H is a p p orthogonal matrix, and Λ is a q q diagonal matrix of which the a-th diagonal element is a singular value λ a, i.e., Λ = diagλ,..., λ q with λ λ q 0. Let G be a q q random matrix distributed according to the non-central Wishart distribution with n degrees of freedom, covariance matrix Φ, and non-central parameter matrix ΛΦ, that is, where Φ is defined by We prepare the following assumptions for the moments of G: Assumption A. E[trG 2 ] = O. G W q n, Φ ; ΛΦ, 8 Φ = n I q + Λ. 9 Assumption A2. E[trG 4 ] = O, E[trG 2 2 ] = On + λ q, E[trG 8 ] = On + λ q 3. Note that Φ = On + λ q and ΛΦ = O because Φ 2 = ΛΦ 2 = q n + λ i 2 q n + λ q 2 = On + λ q 2, q 2 λ i qλ 2 n + λ i n + λ 2 = O, i= i= where is the Frobenius norm. Therefore, Assumptions A and A2 are natural. We obtain the asymptotic unbiasedness of the HAIC C for R KL as Theorem 3.. Theorem 3. is directly derived from 0, Proposition B.2 in Appendix B, and Proposition C.2 in Appendix C. 7

8 Theorem 3.. Suppose that q p < p, and Assumptions A and A2 hold. Then, under overspecified models 7, we have R KL E Y [HAIC C ] = Opn n + λ q /2, as n and p/n c [0,, where λ q is the q-th singular value of Σ /2 Ξ Σ /2. From Theorem 3., the HAIC C approximates the risk function well in cases where n is not large and when n and p are both large. Thus, it is expected that the HAIC C will work well. 4 Numerical simulations We conduct numerical simulations to show that the HAIC C in 2 works better than the AIC in. The p redundancy models j j =,..., p were prepared for Monte Carlo simulations with 0,000 iterations. Here, model j denotes a model such that y,..., y j are not redundant and y j+,..., y p are redundant. Suppose that the true model is the model such that p -response variables y,..., y p are not redundant. The data Y were generated from the true model N n p ZΘ, Σ I n, where Z = n, X, Θ = α, B, and we gave α = p and Σ = 0.8I p p p. We constructed X and B as follows. First, we independently generated u,..., u n from U,, where Ua, b denotes the uniform distribution with the range a, b. Using u,..., u n, we constructed X as X = I n J n X 0, where i, j-th element of X 0 is defined by u j i i n, j q. Second, let B and B 2 be submatrices of B in 4 when p = p. Similarly, Σ and Σ 2 are submatrices of Σ. Using the above notation, we constructed B 2 as B 2 = B Σ Σ 2, where the l l p -th column vectors and p - th column vector of B are defined by, 2,..., 2 and, 2p,..., 2p, respectively. Next, we set q = 2, and data were generated for various combinations of n, p, and p. In order to compare the performances of the HAIC C and AIC, the following three properties were considered: Ratio of the information criterion IC average to the risk function R KL : E Y [IC]/R KL. The probability of selecting the true model: the frequency with which the true model is chosen as the best model. The KL information of the predicted values of the best model chosen by the information criterion, which is defined by KL = np {E Y [L Θ best, Σ best ] LΘ, Σ }, where LΘ, Σ is given by 8 and Θ best and Σ best are the MLEs of Θ and Σ, respectively, under the best model. A high-performance model selector is considered to be a model selection criterion where the ratio of the information criterion average to the risk function is close to, where there is a high probability of selecting the true model, and where there is a small amount of KL information. Figures -3 show the ratio of the average of each criterion, i.e., the HAIC C and the AIC to R KL. From Figures -3, it can be observed that the values for the HAIC C are closer to than those for 8

9 the AIC. Therefore, the HAIC C can approximate the risk exactly, but the AIC underestimates this risk. Moreover, as the number of dimensions p increases, the bias of the AIC increases. Table shows the probabilities of selecting the true model and the amount of KL information for the HAIC C and AIC. Therein, it is clear that the probability for the HAIC C is higher than that for the AIC. On the other hand, the KL information for the HAIC C nearly equals that for the AIC, except when p = 80 and p = 60, when it is significantly smaller than that for the AIC. p = 0, p = 5 p = 20, p = 5 p = 20, p = 0 p = 40, p = 5 p = 40, p = 20 Figure : Ratio of the average of each criterion to the risk function with n = 50 9

10 p = 0, p = 5 p = 40, p = 5 p = 40, p = 20 p = 80, p = 5 p = 80, p = 40 Figure 2: Ratio of the average of each criterion to the risk function with n = 00 0

11 p = 0, p = 5 p = 80, p = 5 p = 80, p = 40 p = 60, p = 5 p = 60, p = 80 Figure 3: Ratio of the average of each criterion to the risk function with n = 200

12 Table : Probabilities of selecting the true model and the KL information of the predicted values of the best model Probabilities % KL n p p HAIC C AIC HAIC C AIC Acknowledgments Ryoya Oda was supported by a Research Fellowship for Young Scientists from the Japan Society for the Promotion of Science. Hirokazu Yanagihara and Yasunori Fujikoshi were both partially supported by the Ministry of Education, Science, Sports, and Culture through Grantsin-Aid for Scientific Research C, #8K0345 and #6K00047, respectively. Appendix A Deriving MLEs and proof of 9 To derive MLEs, we use the following lemma e.g., Fujikoshi et al., 200, A.2.: Lemma A.. Let Y be an n p known matrix and A be an n k known matrix of rank k. Consider a function of p p positive definite matrix Σ and k p matrix Θ given by gθ, Σ = m log Σ + tr{y AΘΣ Y AΘ }, where m > 0 and all elements of Θ are bounded. Then, gθ, Σ takes the minimum at Θ = Θ = A A A Y, Σ = Σ = m Y I n P A Y, 2

13 and the minimum value is given by m log Σ + mp. First, we derive MLEs of Σ, Σ 22, α, α 2, B, B 2, Θ, Θ 2, Γ, and δ. From 6, we have 2lΘ, Σ; Y, X = np log 2π + n log Σ + tr{y ZΘ Σ Y ZΘ } + n log Σ 22 + tr{y 2 n δ Y ΓΣ 22 Y 2 n δ Y Γ }. Let g Θ, Σ = n log Σ +tr{y ZΘ Σ Y ZΘ } and g 2 δ, Γ, Σ 22 = n log Σ 22 + tr{y 2 n δ Y ΓΣ 22 Y 2 n δ Y Γ }. The set of unknown parameters {Θ, Σ} exhibits one-to-one correspondence with the set {α, B, C 2, δ, Γ, Σ, Σ 22 }, and each set of parameters for g and g 2 is separated. Thus, we only consider each minimization of g and g 2. Note that rankz = q + and rank{ n, Y } = p +. Then, from Lemma A., we can state that min g Θ, Σ = n log Σ + tr{y Z Θ Σ Y Z Θ } Θ,Σ = n log Σ + np, min δ,γ,σ 22 g 2 δ, Γ, Σ 22 = n log Σ 22 + tr{y 2 n δ Y Γ Σ 22 Y 2 n δ Y Γ } = n log Σ 22 + np p, where Θ = Z Z Z Y, Σ = n Y I n P Z Y, δ, Γ = { n, Y n, Y } n, Y Y 2 and Σ 22 = n Y 2 I n P n,y Y 2. Therefore, the MLEs of α, α 2, B, B 2, and Θ 2 can also be derived from Θ and δ. Next, we show 9. From Lemma A., it is clearly that 2l Θ, Σ; Y, X = np{log 2π + } + nlog Σ + log Σ 22. B Calculating and expanding bias Here, we present propositions for calculating and expanding the bias B KL in 0. B. Calculating bias and its proof Proposition B.. Suppose that q p < p. In overspecified models 7, we obtain the exact expression of the bias as follows: B KL = np + np n + q + n p q 2 + nn + p p n p 2 where Ξ is defined in 6. + nn + p p EY [trs n p 2 Σ ] + np p n p 2 E Y [trs Ξ ], B. The proof of Proposition B. is presented as follows. To calculate the bias, we describe a lemma for distributions of some statistics the proof is given in Appendix D. 3

14 Lemma B.. Suppose that q p < p. Let Σ 22 be the MLE of Σ 22 under model 6, and let [ Γ ] T = {Y I n J n Y } /2 {Y I n J n Y } Y I n J n Ω Γ Σ /2 22, B.2 ] [ δ U = { n, Y n, Y } /2 { n, Y Γ n, Y } n, Y Ω + Y Γ Σ /2 22, B.3 where Ω = 2 Γ, and and 2 are the n p and n p p partitioned matrices of =, 2. Then, T and U are independent of Y, and n Σ 22 and δ, Γ are mutually conditional independent under Y, and n Σ 22 Y W p p n p, Σ 22 ; Ω, T N p p p O p,p p, I p p I p, U N p+ p p O p +,p p, I p p I p +, where Ω = Ω I n P n,y Ω. Moreover, if the model is overspecified model 7, then n Σ 22 and Y are mutually independent, n Σ 22 and δ, Γ are also mutually independent, and n Σ 22 W p p n p, Σ 22, T = {Y I n J n Y } /2 Γ Γ Σ /2 U = { n, Y n, Y } /2 δ δ Γ Γ 22, Σ /2 22, where δ = α 2 Γ α and α and α 2 are the p - and p p -dimensional subvectors of α = α, α 2. From the general formula of the determinant of a partitioned matrix e.g., Lütkepohl, 997, , log Σ = log Σ + log Σ 22 holds. Under overspecified models 7, L Θ, Σ can be expressed as L Θ, Σ = np log 2π + nlog Σ + log Σ 22 + ntrσ Σ + tr{ Σ Θ Θ Z Z Θ Θ }. Thus, from the above equation and 9, the bias B KL in 0 can be expressed as [ B KL = np + EY ntrσ Σ + tr{ Σ Θ Θ Z Z Θ ] Θ }. We separate the second and third terms in the above equation into four terms in order to calculate the bias. It follows from the general formula for the inverse of a block matrix e.g., Harville, 997, Theorem 8.5. that Σ Σ O p,p p = Γ + O p p,p O p p,p p I p p Σ 22 Γ, I p p. Then, we can derive another expression of the bias B KL as follows: [ { }] B KL = np + ney [trσ Σ ] + ne Y tr Σ 22 Γ Γ, I p p Σ I p p 4

15 [ + EY tr{ Σ Θ Θ Z Z Θ ] Θ } [ { + E Y tr Σ 22 Γ, I p p Θ Θ Z Z Θ Θ Γ I p p }] = np + i + ii + iii + iv say, B.4 where Θ and Θ 2 are the q + p and q + p p submatrices of Θ = Θ, Θ 2. Now, we calculate i, ii, iii, and iv in B.4. Starting with i in B.4, it can be stated that nσ /2 Σ Σ /2 is distributed according to W p n q, I p. Thus, by using the expectation of an inverted Wishart distribution see, Watamori, 990, we have i = ne Y [trσ Σ ] = n 2 p n p q 2. B.5 Now we calculate ii in B.4. From Lemma B., by using the expectation of an inverted Wishart distribution e.g., Fujikoshi et al., 200, Theorem 2.2.7, the expectation of EY [ Σ 22 ] can be calculated as EY [ Σ 22 ] = ne Y [n Σ 22 n ] = n p 2 Σ 22. From the above equation and the fact that Σ 22 is independent of Γ, ii can be expressed as [ }] n 2 Γ ii = n p 2 E Y tr {Σ Σ 22 Γ, I p p. By the definition of T in B.2, we have Γ Σ Σ 22 Γ, I p p I p p = Σ Γ I p p Σ 22 Γ, I p p I p p {Y I n J n Y } /2 T Σ / Σ Γ {Y I n J n Y } /2 T Σ /2 22 O p p,p O p p,p p {Y I n J n Y } /2 T Σ / Σ Γ {Y I n J n Y } /2 T Σ /2 22 O p p,p O p p,p p {Y I n J n Y } /2 T T {Y I n J n Y } /2 O p,p p + Σ O p p,p O p p,p p = ii.a + ii.b + ii.c + ii.d say. From the general formula for the inverse of a block matrix, it is straightforward to observe that tr{ii.a} = p p. Since T and Y are mutually independent, we can calculate the expectations of tr{ii.b} and tr{ii.c} as follows: E Y [tr{ii.b}] = 0, E Y [tr{ii.c}] = 0. 5

16 Note that E Y [T T ] = p p I p. Then, from the independence of T and Y, we can calculate tr{ii.d} as E Y [tr{ii.d}] = p p E Y [ tr { Σ {Y I n J n Y } }]. Hence, ii can be expressed as ii = n2 p p n p 2 { + E Y [ tr { Σ {Y I n J n Y } }]}. B.6 Now we move on to calculating iii in B.4. Recall that n Σ = Y I n P Z Y and Θ = Z Z Z Y. Since it is straightforward to observe that Z Z Z I n P Z = O q+,n, n Σ and Θ are mutually independent from Cochran s Theorem e.g., Fujikoshi et al., 200, Theorem Thus, by using a property of conditional expectations, we obtain n iii = n p q 2 E Y = n n p q 2 trp ZtrI p = np q + n p q 2. [ tr{σ P ZY ZΘ P Z Y ZΘ } ] B.7 Finally, we calculate iv in B.4. Recall that Θ is expressed by α, α 2, B, and B 2. Then, we can express Γ, I p p Θ Θ as Γ, I p p Θ Θ = Γ α α B B, I p p α 2 α 2 B 2 B 2 = α 2 α 2 Γ α α, B 2 B 2 Γ B B. From the definitions of δ, α, and α 2, we have δ = α 2 Γ α. Thus, α 2 α 2 Γ α α = α 2 Γ α α 2 Γ α = δ δ + α 2 Γ α α 2 Γ α B.8 = δ δ + Γ Γ α { } =, α δ δ. B.9 Γ Γ Since it is straightforward that B 2 B Γ = Oq,p p and it is prudent that B 2 B Γ = O q,p p under overspecified models, we obtain B 2 B 2 B B Γ = B 2 B Γ B2 B Γ + B Γ Γ δ δ = 0 q, B. B.0 Γ Γ By using B.9 and B.0, we can express B.8 as Γ, I p p Θ Θ δ δ α = Γ Γ 0 q B 6

17 = Σ /2 22 U { n, Y n, Y } /2 α, 0 q B where U is defined in B.3. Thus, by using the above equation and the independence of U and Σ 22, iv can be expressed as follows: [ }] iv = np p n p 2 E Y tr {{ n, Y n, Y } n nα nα nα α + B. X XB From the general formula for the inverse of a block matrix, { n, Y n, Y } is expressed as { n, Y n, Y } n + ȳ = S ȳ ȳ S S ȳ S. Note that S and ȳ are mutually independent and E Y [ȳ ] = α. Thus, we can calculate iv as follows: iv = np p n p 2 = np p n p 2 [ [ { + E Y tr {nȳ α ȳ α + B X XB }S }]] [ [ { + E Y tr {Σ + B X XB }S }]]. B. Therefore, from B.4, B.5, B.6, B.7, and B., Proposition B. can be derived. B.2 Results for expanding bias and its proof Proposition B.2. Suppose that q p < p and Assumption A both hold. Then, under overspecified models 7, we have B KL = mn, p + Op max {n 2, n + λ q 3/2 }, B.2 as n, p/n c [0,, where λ q is the q-th diagonal element of Λ defined in 7. Here, mn, p is the constant given by mn, p = m n, p + m 2 n, p, where m n, p is defined in 3, and m 2 n, p is given by m 2 n, p = n2 p p { 2q + trφ ntrφ 2 ntrφ 2 }, n pn p in which Φ is defined by 9. The proof of Proposition B.2 is presented as follows. We use the following lemma the proof is given in Appendix D. Lemma B.2. Suppose that n q 2 > 0. Let G be random matrices defined by G = n + λ q G I q, 7

18 where G is defined by 8, λ q are given by 7, and Φ = n I q + Λ is defined by 9. Then, using the HHD asymptotic framework, we have G = O p, and G can be expanded as G = I q G + n + λq n + λ G2 + R G, q B.3 where R G = n + λ q 3/2 G G3 = O p n + λ q 3/2. Further, when Assumption A holds, we have E[trD Φ R G ] = On n + λ q 3/2, B.4 where D = + n I q + n Λ and Λ is given in 7. We focus on the case that q < p < p because the proof is simpler where p = q. Let V = H Σ /2 S Σ /2 H, B.5 where H is the p p orthogonal matrix given by 7. It is straightforward that V is distributed according to W p n, I p ; Λ, where Λ is given by Λ O q,p q Λ =. B.6 O p q,q O p q,p q Further, we partition V as V V 2 V =, V 2 V 22 where the sizes of V and V 22 are q q and p q p q, respectively. Let Ṽ be the p p random matrix given by Ṽ = Φ /2 V Φ /2. B.7 Since Φ = E Y [V ], it is straightforward that Ṽ is distributed according to W q n, Φ ; ΛΦ. Now, we calculate the parts of the expectations in B.. Using the definitions in 7 and B.5, we have nn + p p EY [trs n p 2 Σ ] + np p n p 2 E Y [trs Ξ ] = n2 p p [ n p 2 E Y trdv ], B.8 where D and the partitioned expression are given by D = + n I p + n Λ = D O q,p q O p q,q D 22 By applying the general formula for the inverse of a block matrix to V, trdv can be expressed as trdv = trd V + trd V V 2V 22 V 2V + trd 22V 22. Recall that V W p n, I p ; Λ. By using the properties of a partitioned non-central Wishart matrix see Kabe, 964, we can state that V W q n, I q ; Λ, V 22 W p qn q. 8

19 , I p q, V 2 V /2 N p q p O p q,p, I p I p q, and V, V 22 and V 2 V /2 are mutually independent. Thus, we can calculate the expectations of trd V V 2V22 V 2V and trd 22 V22 as follows: E Y [trd V V 2V 22 V 2V ] = p 2 n p 2 E Y [trd V ], EY [trd 22 V22 ] = n p 2 trd 22. Hence, from the above equations, E Y [ trdv ] is expressed as E Y [ trdv ] = n q 2 n p 2 E Y [trd V ] + n p 2 trd 22. B.9 From the definition of Ṽ in B.7, EY [trd V ] is expressed as E Y [trd Φ Ṽ ]. Let L = n + λ q Ṽ I q. Then, by using B.3 and B.4 in Lemma B.2, we can expand EY [trd V ] as E Y [trd V ] = trd Φ + n + λ q E Y [trd Φ L 2 ] + On n + λ q 3/2. B.20 From the expectation of a non-central Wishart distribution, n + λ q E Y [trd Φ L 2 ] can be calculated as EY [trd Φ L 2 ] n + λ q = 2 trφ 3 n 5 trφ 2 + n From B.2, we can express B.20 as E Y [trd V ] = trφ 2 trφ trφ 2 2 trφ trφ 2 n n 2q + trφ + q n n trd Φ. B.2 2q + trφ + q n n + On n + λ q 3/2. B.22 Thus, by using B.8, B.9, B.22, and Proposition B., we can expand the bias as follows: B KL = np + np n + q + n p q 2 + nn + p p n p 2 + n2 n q 2p p n p 2n p 2 + n 2 p p n p 2n p 2 { trφ 2 trφ 2 + 2q + trφ + q } n n + p q + Opn + λ q 3/2. n Note that n p q 2 = n p n p 2 = n p { + q + 2 n p + } q + 22 n p 2 + On 3, }, { + 2 n p + 4 n p 2 + On 3 9

20 n p 2 = { + 2 n p n p 2n p 2 = n p + } 4 n p 2 + On 3, { + 2 n p + 2 n p + n pn p 4 + n pn p + 8 n p n p 3 8 n p 2 n p + 8 n pn p 2 + On 4 4 n p n p 2 }. Therefore, by using the above equations, we can derive B.2. C Asymptotic properties of τ, τ 2 and τ 3 Here we present the results for expanding τ, τ 2 and τ 3 defined by 4 and 5. C. Expanding τ, τ 2, and τ 3 and the proof Proposition C.. Suppose that q p < p. Then, we have τ = p p trφ + O p pn /2 n + λ q, τ 2 = np p trφ 2 + O p pn /2 n + λ q 2, τ 3 = np p trφ 2 + O p pn /2 n + λ q 2, as n, p/n c [0,, where Φ is defined by 9. The proof of Proposition C. is presented as follows. To expand τ, τ 2 and τ 3 by using a HHD asymptotic framework, we use the following lemma which was essentially obtained in Wakaki et al. 204: Lemma C.. Let F h = F F and F e be independently distributed according to W p q, I p ; M M and W p n, I p, respectively. Here, F is a q p random matrix distributed according to N q p M, I p I q. Put B = F F, W = B /2 F F e F B /2. Then, B and W are independently distributed according to W q p, I q ; MM and W q n p+q, I q, respectively. Further, the nonzero eigenvalues of F h Fe we have tr{f e F e + F h } = tr{w W + B } + p q, are the same as those of BW, and tr [ {F e F e + F h } 2] = tr [ {W W + B } 2] + p q. We consider the distributions of S e and S h defined in 5. Since Y is distributed according to N n p ZΘ, Σ I n, it is straightforward that S e and S h are mutually independent and S e W p n q, Σ, S h W p q, Σ ; Ξ, 20

21 where Ξ is given by 6. Let T e = H Σ /2 S eσ /2 H, T h = H Σ /2 S hσ /2 H, where H is the p p orthogonal matrix defined in 7. non-central Wishart matrices that It follows from the properties of T e W p n q, I p, T h W p q, I p ; Λ, where Λ is defined by B.6. Then, we can express tr{s e S e +S h } and tr[{s e S e +S h } 2 ] as tr{s e S e + S h } = tr{t e T e + T h }, tr[{s e S e + S h } 2 ] = tr [ {T e T e + T h } 2]. First, we express tr{s e S e + S h } and tr[{s e S e + S h } 2 ] as functions of q q matrices in order to examine their asymptotic behaviors. From the definition of the non-central Wishart distribution, a different expression for T h is given by T h = U h U h, where U h N q p Λ, I p I q and Λ is the q p partitioned matrix of Λ = Λ, O p,p q satisfying Λ Λ = Λ. Let W = U h U h, W 2 = W /2 U h T e U h W /2. C. Then, from Lemma C., W and W 2 are independently distributed according to W q p, I q ; Λ and W q n p, I q, respectively, and we have tr{s e S e + S h } = tr{w 2 W + W 2 } + p q, tr[{s e S e + S h } 2 ] = tr [ {W 2 W + W 2 } 2] + p q. Next, we expand tr{w 2 W + W 2 } and tr [ {W 2 W + W 2 } 2]. Let W 3 = Φ /2 W + W 2 Φ /2. C.2 Since W and W 2 are mutually independent, we can state that W 3 W q n, Φ ; ΛΦ. Therefore, using B.3 in Lemma B.2, W 3 can be expanded as W 3 = I q + O p n + λ q /2. C.3 By calculating the expectation and variance of W 2, we can also expand W 2 as Thus, Φ /2 W 2 Φ /2 is expressed as W 2 = n p I q + O p n /2. Φ /2 W 2 Φ /2 = n p Φ + O p n /2 n + λ q. C.4 From C.3 and C.4, the following equations can be derived: tr{w 2 W + W 2 } = trφ /2 W 2 Φ /2 W 3 = n p trφ + O p n /2 n + λ q, tr{w 2 W + W 2 } 2 = n p 2 trφ 2 + O p n 3/2 n + λ q 2, 2

22 tr [ {W 2 W + W 2 } 2] = tr{φ /2 W 2 Φ /2 W 3 2 } Thus, we can derive the following equations: = n p 2 trφ 2 + O p n 3/2 n + λ q 2. p p n p tr{w 2 W + W 2 } = p p trφ + O p pn /2 n + λ q, np p n p 2 tr{w 2W + W 2 } 2 = np p trφ 2 + O p pn /2 n + λ q 2, np p n p 2 tr [ {W 2 W + W 2 } 2] = np p trφ 2 + O p pn /2 n + λ q 2. Therefore, we can derive the result of Proposition C.. C.2 Expanding the expectations of τ, τ 2, and τ 3 Proposition C.2. Suppose that q p < p and Assumptions A and A2 hold. Then, we have E Y [ τ ] = p p trφ + Opn n + λ q /2, E Y [ τ 2 ] = np p trφ 2 + Opn n + λ q /2, E Y [ τ 3 ] = np p trφ 2 + Opn n + λ q /2, as n, p/n c [0,, where Φ is defined by 9. The proof of Proposition C.2 is presented as follows. Let W 3 = n + λ q W 3 I q, where W 3 is given by C.2. Then, from Lemma B.2, we can observe that W3 is expanded as W3 = I q R W3, where R W3 = n + λ q /2 W3 W 3. We consider EY [tr{w 2W + W 2 }], EY [tr{w 2W + W 2 } 2 ] and EY [tr[{w 2W + W 2 } 2 ]], where W and W 2 are given by C.. By using R W3, we have E Y [tr{w 2 W + W 2 }] = E Y [trφ /2 W 2 Φ /2 ] E Y [trφ /2 W 2 Φ /2 R W3 ], E Y [tr{w 2 W + W 2 } 2 ] = E Y [trφ /2 W 2 Φ /2 2 ] 2E Y [trφ /2 W 2 Φ /2 trφ /2 W 2 Φ /2 R W3 ] C.5 + E Y [trφ /2 W 2 Φ /2 R W3 2 ], C.6 E Y [tr[{w 2 W + W 2 } 2 ]] = E Y [tr{φ /2 W 2 Φ /2 2 }] 2E Y [tr{φ /2 W 2 Φ /2 2 R W3 }] + E Y [tr{φ /2 W 2 Φ /2 R W3 2 }]. C.7 Since Φ /2 W 2 Φ /2 is distributed according to W q n p, Φ, we can calculate the expectation as follows: E Y [trφ /2 W 2 Φ /2 ] = n p trφ + On + λ q, E Y [trφ /2 W 2 Φ /2 2 ] = n p 2 trφ 2 + Onn + λ q 2, E Y [tr{φ /2 W 2 Φ /2 2 }] = n p 2 trφ 2 + Onn + λ q 2. C.8 C.9 C.0 22

23 By using the Cauchy-Schwarz inequality, we have E Y [trφ /2 W 2 Φ /2 R W3 ] = EY [trφ /2 W 2 Φ /2 W W 3 ] n + λq 3 n + λq E Y [tr{φ /2 W 2 Φ /2 4 }] /2 E Y [tr W 4 3 ] /2 E Y [trw 2 3 ] /2 = On + λ q /2, C. E Y [trφ /2 W 2 Φ /2 trφ /2 W 2 Φ /2 R W3 ] n + λq E Y [tr{φ /2 W 2 Φ /2 8 }] /4 E Y [tr W 4 3 ] /4 E Y [trw 2 3 ] /2 = On + λ q /2, C.2 E Y [trφ /2 W 2 Φ /2 R W3 2 ] n + λ q E Y [tr{φ /2 W 2 Φ /2 4 } 2 ] /4 E Y [tr{ W 4 3 } 2 ] /4 E Y [tr{w 2 3 } 2 ] /2 = On + λ q /2, C.3 E Y [tr{φ /2 W 2 Φ /2 2 R W3 }] n + λq E Y [tr{φ /2 W 2 Φ /2 4 }] /2 E Y [tr W 8 3 ] /4 E Y [trw 4 3 ] /4 = On + λ q /2, C.4 E Y [tr{φ /2 W 2 Φ /2 R W3 2 }] n + λ q 2 E Y [tr{φ /2 W 2 Φ /2 4 }] /2 E Y [tr W 8 3 ] /2 E Y [trw 8 3 ] /2 = On + λ q /2. C.5 Therefore, from C.5-C.5, we have E Y [ τ ] = p p trφ + Opn n + λ q /2, E Y [ τ 2 ] = np p trφ 2 + Opn n + λ q /2, E Y [ τ 3 ] = np p trφ 2 + Opn n + λ q /2. D Proofs of Lemma B. and B.2 D. Proof of Lemma B. From a property of a conditional distribution of a multivariate normal distribution e.g., Srivastava and Khatri, 979, we have Y 2 Y N n p pω + Y Γ, Σ 22 I n. Then, another expression of the true model M is given as follows: Y N n p, Σ I n, Y 2 Y N n p pω + Y Γ, Σ 22 I n. 23

24 On the other hand, we can express Y 2 as Y 2 = Ω + Y Γ + AΣ /2 22, D. where A and Y are mutually independent, and A is distributed according to N n p p O n,p p, I p p I n. By using D., n Σ 22, Γ, and δ, Γ are expressed as n Σ 22 = Ω + AΣ /2 22 I n P n,y Ω + AΣ /2 22. D.2 Γ = {Y I n J n Y } Y I n J n Ω + AΣ / Γ, δ = { n, Y Γ n, Y } n, Y Ω + Y Γ + AΣ /2 22. Hence, the conditional distribution of n Σ 22, Γ, and δ, Γ under Y are n Σ 22 Y W p p n p, Σ 22 ; Ω, Γ Y N p p p {Y I n J n Y } Y I n J n Ω + Γ, Σ 22 {Y I n J n Y }, δ Γ Y N p+ p p { n, Y n, Y } n, Y Ω + Y Γ, Σ 22 { n, Y n, Y }. From the above equations, we can state that T and U are distributed according to N p p p O p,p p, I p p I p and N p + p p O p+,p p, I p p I p+, respectively, and are independent of Y. Next, we show that n Σ 22 and δ, Γ are mutually conditionally independent under Y. It is straightforward that { n, Y n, Y } n, Y I n P n,y = O p +,n. D.3 Hence, from Cochran s Theorem, the conditional independence of n Σ 22 and δ, Γ is obtained. Finally, we consider the case that the model is overspecified. By the definition of overspecified models in 7, the equation Ω = n δ holds. Thus, from D.2, we can state that n Σ 22 = Σ /2 22 A I n P n,y AΣ /2 22 W p p n p, Σ 22, and n Σ 22 and Y are mutually independent. Also, T and U are expressed as T = {Y I n J n Y } /2 Γ Γ Σ /2 U = { n, Y n, Y } /2 δ δ Γ Γ 22, Σ /2 22, and n Σ 22 and δ, ˆΓ are mutually independent from D.3. D.2 Proof of Lemma B.2 This proof is based on an idea in Hashiyama et al From the expectation of a noncentral Wishart distribution, we can calculate the expectations as E[G] = I q and E [ G I q 2] = n trφ 2 n trφ 2 + 2q + trφ 24

25 = On + λ q. The above equation implies that G = O p. By applying Taylor expansion to G, we derive G = I q + G = I q G + n + λq n + λq n + λ G2 + R G, q D.4 where R G is a remainder term in the Taylor expansion. First, we calculate an explicit form of R G. From D.4, the following equation can be derived: I q = GG = I q + = I q + GR G G I q n + λq n + λ q 3/2 G 3. From the above equation, R G can be expressed as R G = G + n + λq n + λ G2 + R G q n + λ q 3/2 G G3. Next, we show B.4. By using the Cauchy-Schwarz inequality, we have E[ trd Φ R G ] n + λ q 3/2 { E[trD 2 Φ 2 G6 ]E[trG 2 ]} /2. We can observe that D Φ = On because nd Φ 2 = q 2 n + + λi = O. i= n + λ i By using results for expectations of the multivariate normal random vector e.g., Gupta and Nagar, 2000, we can calculate E[trD 2 Φ 2 G6 ] = On 2. Therefore, we derive B.4. References [] Akaike, H Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory eds. B. N. Petrov & F. Csáki, pp Akadémiai Kiadó, Budapest. [2] Alves, J. C. L. & Poppi, R. J Quantification of conventional and advanced biofuels contents in diesel fuel blends using near-infrared spectroscopy and multivariate calibration. Fuel, 65, [3] Bedrick, E. J. & Tsai, C.-L Model selection for multivariate regression in small samples. Biometrics, 50, [4] Bro, R Multivariate calibration: What is in chemometrics for the analytical chemist? Anal. Chim. Acta, 500, [5] Brown, P. J Multivariate calibration. J. R. Statist. Soc., B,

26 [6] Carvalho, B. M. A., Carvalho, L. M., Coimbra, J. S. R., Minim, L. A., Barcellos, E. S., Júnior, W. F. S., Detmann, E. & Carvalho, G. G. P Rapid detection of whey in milk powder samples by spectrophotometric and multivariate calibration. Food Chem., 74, 7. [7] Fujikoshi, Y. and Nishii, R Selection of variables in multivariate regression. Hiroshima Math. J., 3, [8] Fujikoshi, Y., Sakurai, T. & Yanagihara, H Consistency of high-dimensional AICtype and C p -type criteria in multivariate linear regression. J. Multivariate Anal., 23, [9] Fujikoshi, Y., Ulyanov, V. V. & Shimizu, R Multivariate Statistics: High- Dimensional and Large-Sample Approximations. John Wiley & Sons, Inc., Hoboken, New Jersey. [0] Gupta, A. & Nagar, D Matrix Variate Distributions. Chapman and Hall/CRC, Boca Raton, FL. [] Harville, D. A Matrix Algebra from a Statistician s Perspective. Springer-Verlag, New York. [2] Hashiyama, Y., Yanagihara, Y. & Fujikoshi, Y Jackknife bias correction of the AIC for selecting variables in canonical correlation analysis under model misspecification. Linear Algebra Appl., 455, [3] Hurvich, C. M. & Tsai, C.-L Regression and time series model selection in small samples. Biometrika, 76, [4] Kabe, D. G A note on the Bartlett decomposition of a Wishart matrix. J. Roy. Statist. Soc. Ser. B, 26, [5] Lütkepohl, H Handbook of Matrices. Wiley, Chichester. [6] Srivastava, M. S. & Khatri, C. G An Introduction to Multivariate Statistics. North- Holland, New York. [7] Sugiura, N Further analysis of the data by Akaike s information criterion and the finite corrections, Comm. Statist. Theory Methods, A7, [8] Wakaki, H., Fujikoshi, Y. & Ulyanov, V. V Asymptotic expansions of the distributions of MANOVA test statistics when the dimension is large. Hiroshima Math. J., 44, [9] Watamori, Y On the moments of traces of Wishart and inverted Wishart matrices. South African Statist. J., 24,

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Yasunori Fujikoshi and Tetsuro Sakurai Department of Mathematics, Graduate School of Science,