Selection of Model Selection Criteria for Multivariate Ridge Regression

Selection of Model Selection Criteria for Multivariate Ridge Regression (Last Modified: January, 24) Isamu Nagai, Department of Mathematics, Graduate School of Science, Hiroshima University -3- Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8626, Japan E-mail: d09348@hiroshima-u.ac.jp Abstract In the present study, we consider the selection of model selection criteria for multivariate ridge regression. There are several model selection criteria for selecting the ridge parameter in multivariate ridge regression, e.g., the C p criterion and the modified C p (MC p ) criterion. We propose the generalized C p (GC p ) criterion, which includes C p and MC p criteria as special cases. The GC p criterion is specified by a non-negative parameter λ, which is referred to as the penalty parameter. We attempt to select an optimal penalty parameter such that the predictive mean square error (PMSE) of the predictor of ridge regression after optimizing the ridge parameter is minimized. Through numerical experiments, we verify that the proposed optimization methods exhibit better performance than conventional optimization methods, i.e., optimizing only the ridge parameter by minimizing the C p or MC p criterion. Key words: Asymptotic expansion; Generalized C p criterion; Model selection criterion; Multivariate linear regression model; Ridge regression; Selection of the model selection criterion.. Introduction In the present paper, we deal with a multivariate linear regression model with n observations of a p-dimensional vector of response variables and a k-dimensional vector of regressors (for more detailed information, see, for example, Srivastava, 2002, Chapter 9; Timm, 2002, Chapter 4). Let Y = (y,..., y n ), X, and E = (ε,..., ε n ) be the n p matrix of response variables, the n k matrix of non-stochastic centerized explanatory variables (i.e., X n = 0 k ) of rank(x) = k (< n), and the n p matrix of error variables, respectively, where n is the sample size, n is an n-dimensional vector of ones, and 0 k is a k-dimensional vector of zeros. Suppose that n k p 2 > 0 and ε,..., ε n i.i.d. N p (0 p, Σ), where Σ

is a p p unknown covariance matrix. Then, the matrix form of the multivariate linear regression model is expressed as Y = n µ + XΞ + E, where µ is a p-dimensional unknown location vector, and Ξ is a k p unknown regression coefficient matrix. This model can also be expressed as Y N n p ( n µ + XΞ, Σ I n ). Note that X is centerized. The maximum likelihood or the least squares (LS) estimators of µ and Ξ are given by ˆµ = Y n /n and ˆΞ = (X X) X Y, respectively. Since ˆµ and ˆΞ are simple, and the unbiased estimators of µ and Ξ, LS estimators are widely used in actual data analysis (see, e.g., Dien et al., 2006; Sârbu et al., 2008, Saxén and Sundell, 2006; Skagerberg, Macgregor, and Kiparissides, 992; Yoshimoto, Yanagihara, and Ninomiya, 2005). However, the problem of an estimator of Ξ becoming unstable when multicollinearity occurs in X is well known. In order to avoid this problem, a ridge regression was proposed by Hoerl and Kennard (970) when p =. Several studies extended this univariate ridge regression to the multivariate case, e.g., Brown and Zidek (980), Haitovsky (987), and Yanagihara and Satoh (200). The ridge regression estimator of Ξ is given as where M ˆΞ = M X Y, = X X + I k, and is a nonnegative value, which is referred to as a ridge parameter. Since an estimate of ˆΞ depends strongly on the value of the ridge parameter, the optimization of is an important problem in ridge regression. An optimal is commonly determined by minimizing the predicted mean square error (PMSE) of the predictor of Y, i.e., Ŷ = n ˆµ + X ˆΞ. However, we cannot directly use the PMSE to optimize, because unknown parameters are included in the PMSE. Hence, we adopt an optimization method using a model selection criterion, i.e., an estimator of PMSE, instead of the unknown PMSE. As an estimator of PMSE, Yanagihara and Satoh (200) proposed a C p criterion. This criterion includes C p criteria for selecting variables in a univariate linear model, which was proposed by Mallows (973; 995), for selecting variables in a multivariate linear model, which was proposed by Sparks, Coutsourides, and Troskie (983) as a special case. Yanagihara and Satoh (200) also proposed the modified C p (MC p ) criterion such that the bias of the C p criterion for choosing the ridge parameter for PMSE is completely corrected under a fixed. This criterion coincides with the bias-corrected C p criterion proposed by Fujikoshi and Satoh (997) when = 0. The MC p criterion has several 2

desirable properties as the estimator of PMSE as described by, e.g., Fujikoshi, Yanagihara, and Wakaki (2005), and Yanagihara and Satoh (200). Unfortunately, optimizing by minimizing MC p, i.e., an unbiased estimator of PMSE, does not always minimize the PMSE of Ŷ. This indicates that there will be an optimal model selection criterion for selecting. Thus, we propose a generalized C p (GC p ) criterion that includes the C p and MC p criteria as special cases (originally, the GC p criterion was proposed by Atkinson, 980, for selecting variables in the univariate linear model). The GC p criterion is specified by a non-negative parameter λ, which is referred to as the penalty parameter. From the viewpoint of making the PMSE of the predictor of Y after optimizing small, we select the optimal penalty parameter λ, which is basically the selection of the model selection criterion. In the present paper, we optimize λ by the following three methods: (Double optimization): We optimize and λ simultaneously by minimizing GC p and the penalty selection criteria, respectively. (Optimization of λ with an approximated value of an optimal ): We optimize λ by minimizing the penalty selection criterion made from the approximated value of optimal. (Asymptotic optimization of λ): We calculate an asymptotic optimal λ from an asymptotic expansion of the PMSE. We then estimate the asymptotic optimal λ. From the optimization of the model selection criterion, we will perform a reasonable optimization of. The remainder of the present paper is organized as follows: In Section 2, we propose the GC p criterion, which includes criteria proposed by Yanagihara and Satoh (200) as special cases. In Section 3, we propose three optimization methods for λ. In Section 4, we compare the optimization methods by conducting numerical studies. Finally, technical details are provided in the Appendix. 2. Generalized C p Criterion In this section, we propose the GC p criterion for optimizing the ridge parameter, which includes C p and MC p criteria proposed by Yanagihara and Satoh (200). Moreover, we present several mathematical properties of the optimal by minimizing the GC p criterion. The PMSE of Ŷ is defined as PMSE[Ŷ] = E Y [E U [tr{(u Ŷ) (U Ŷ)Σ }]], 3

where U is a random variable matrix that is independent of Y and has the same distribution as Y. The C p criterion proposed by Yanagihara and Satoh (200) is a rough estimator of the PMSE of Ŷ, which is defined by C p () = tr(w S ) + 2ptr(M M 0 ), where W is a residual sum of squares matrix defined by W = (Y Ŷ) (Y Ŷ), S is an unbiased estimator of Σ defined by S = W 0 /(n k ). From the definition of the C p criterion, the first term of C p measures the closeness of the ridge regression to the data, and the second term evaluates the penalty for the complexity of the ridge regression. However, the C p criterion has the bias to the PMSE. The MC p proposed by Yanagihara and Satoh (200) is an exact unbiased estimator of the PMSE. By neglecting terms that are independent of, MC p is defined as where c M MC p () = c M tr(w S ) + 2ptr(M M 0 ), = (p + )/(n k ). By comparing the two criteria, we can see that the difference between C p and MC p is a coefficient before tr(w S ). as Thus, we can generalize the model selection criterion for optimizing the ridge parameter GC p (, λ) = λtr(w S ) + 2ptr(M M 0 ), (2.) where λ is a non-negative parameter. Note that GC p (, ) = C p () and GC p (, c M ) = MC p (). In this criterion, the penalty for the complexity of the model, which is in the second term of (2.), becomes large when λ becomes small. This means that λ controls the penalty for the complexity of the model in the criterion (2.). Hence, we can regard λ as a penalty parameter. In the present paper, we consider the optimization of λ to obtain the optimal, which further reduces the PMSE. When λ is fixed, the optimized ridge parameter ˆ(λ) is obtained by ˆ(λ) = arg min GC p (, λ). (2.2) [0, ] Since ˆ(λ) is a minimizer of GC p (, λ), the following equation holds: GC p (, λ) = 0. (2.3) =ˆ(λ) Note that ˆ(λ) changes with λ. Here, we obtain the following mathematical properties of ˆ(λ) (The proof is provided in Appendix A..): 4

Theorem 2.. Let (z,..., z k ) = Q X Y S /2, (2.4) r λ,j = λ z j 2 pd j, (j =,..., k), (2.5) pd 2 j where z i is a p-dimensional vector, Q is a k k orthogonal matrix which diagonalizes X X, i.e., Q X XQ = D = diag(d,..., d k ) and d i (i =,..., k) are eigenvalues of X X, and r + λ, r+ λ,m (m k) are positive values of r λ,,..., r λ,k. Then, ˆ(λ) has the following properties:. ˆ(λ) is a monotonic decreasing function with respect to λ. 2. ˆ(λ) is not 0 when λ [0, ). 3. ˆ(λ) > (r + λ, ) when r + λ, exists. ˆ(λ) = when r + λ, does not exist, i.e., max j=,...,m r λ,j 0. 4. ˆ( ) = 0, ˆ(0) =. 5. ˆ(λ) = for any λ < min j=,...,k pd 2 j/ z j 2. We suppose that d i = O(n). However, we must use an iterative computational algorithm to optimize because we cannot obtain ˆ(λ) in closed form. In order to reduce the number of computational tasks, we consider approximating ˆ(λ) using an asymptotic expansion. Equation (2.3) implies asymptotic expansion of the GC p criterion. From this expansion, we obtain the asymptotic expansion of ˆ(λ) as the follows: Theorem 2.2. Here, ˆ(λ) can be expanded as ˆ(λ) = (L) (λ) + O p (n L ), where (L) (λ) = pb λa + L 2λa n l ( )l+ (l + ) (L )(λ){λ(l l + 2)a l+ (L ) (λ) 2pb l+ }. l=0 (2.6) Here, (0) (λ) = 0, V = X Y S Y X, a j = n j tr(v M (j+2) 0 ), and b j = n j tr(m j 0 ), and l (L) (λ) refers to { (L) (λ)} l. The proofs of this theorem are given in Appendix A.2. Note that (L) (λ) can be used as an approximated value of ˆ(λ). There is a one-to-one correspondence between () (λ) = pb /(λa ) and λ, and () (λ) satisfies the properties, 2, and 4 in Theorem 2.. 5

3. Optimization of Penalty in the GC p Criterion 3.. Double optimization of and λ In the previous section, we considered the model selection criterion for selecting, which can be regarded as an estimator of PMSE[Ŷ]. By minimizing the estimator of PMSE of Ŷ, we expect to reduce the PMSE of Ŷ. However, since the optimal ridge parameter will be changed by the data, it is important to reduce not the PMSE of Ŷ but rather the PMSE of Ŷˆ, i.e., the predictor of Y after optimizing. In this section, we consider optimizing λ using PMSE[Ŷˆ(λ) ], where Ŷˆ(λ) = n ˆµ + X ˆΞˆ(λ). Without a loss of generality, we can assume that the covariance matrix of y i is I p in the PMSE[Ŷˆ(λ) ]. Therefore, from Efron (2004), we obtain PMSE[Ŷˆ(λ) ], which is a function of λ, as follows: PMSE[Ŷˆ(λ) ] = E Y [tr(wˆ(λ) Σ )] + 2E Y [ n i= p j= (Ŷˆ(λ) ) ij where (A) ij are the (i, j)th elements of A. Since ˆ(λ) depends on (Y ) ij, we can see that (Ŷˆ(λ) ) ij = (Ŷ) ij + (Ŷ) ij ˆ(λ). (3.) =ˆ(λ) =ˆ(λ) The first term of the above equation is calculated as (Ŷ) ij = ( n ˆµ ) ij (XM X Y ) ij + =ˆ(λ) =ˆ(λ) ], = ( n ˆµ ) ij + (XM ˆ(λ) X ) ii. Note that n p i= j= ( n ˆµ ) ij / = p and n p i= j= (XMˆ(λ) X ) ii = ptr(m ˆ(λ) M 0). Next, we consider obtaining the second term of (3.). Note that (Ŷ) ij = (XM X Y ) ij = (XM 2 X Y ) ij. Hence, we derive PMSE[Ŷˆ(λ) ] = E Y [tr(wˆ(λ) Σ )] + 2p{E Y [tr(m ˆ(λ) M 0)] + } [ n p 2E Y (XM 2 ˆ(λ) X Y ) ˆ(λ) ] ij. (Y ) i= ij j= (3.2) Based on this result, we need only obtain ˆ(λ)/ in order to calculate PMSE[Ŷˆ(λ) ]. This derivative leads to the following theorem (The proofs are given in Appendix A.3.): 6

Theorem 3.. The PMSE of Ŷˆ(λ) is expressed as where PMSE[Ŷˆ(λ) ] = E Y [tr(wˆ(λ) Σ ) + 2p{tr(M ˆ(λ) M 0) + } + 4B(ˆ(λ))], (3.3) B() = λtr(v M 5 M 0 ) λtr(v M 3 ) 3λtr(V M 4 ) + 2ptr(M 3 M 0 ). By neglecting the terms that are independent of λ, we define the penalty selection criteria for optimizing λ as follows: Definition 3.. The penalty selection criteria to optimize λ are defined as C # p (λ) = tr(wˆ(λ) S ) + 2ptr(M ˆ(λ) M 0) + 4B(ˆ(λ)), MC # p (λ) = c M tr(wˆ(λ) S ) + 2ptr(M ˆ(λ) M 0) + 4B(ˆ(λ)), where ˆ(λ) is given by (2.2) and c M = (p + )/(n k ). Here, note that C # p (λ) is obtained by substituting S for Σ when we neglect the terms that are independent of λ in (3.3). However, there exists a bias because S is not an unbiased estimator of Σ (see, e.g., Siotani, Hayakawa, and Fujikoshi (985)). Based on the results reported by Yanagihara and Satoh (200), we will correct the bias of E Y [tr(wˆ(λ) S )] to E Y [tr(wˆ(λ) Σ )] under fixed ˆ(λ). Finally, we define MC # p (λ) by neglecting terms that are independent of and λ. Using these criteria, λ and are optimized as follows: ˆλ # C = arg min C p # (λ) and ˆ(ˆλ # C ) = arg min GC p (, ˆλ # C ), λ [0, ] [0, ] ˆλ # M = arg min MC p # (λ) and ˆ(ˆλ # M) = arg min GC p (, ˆλ # M). λ [0, ] [0, ] These optimization methods are similar to those reported by Ye (998) and Shen and Ye (2002). 3.2. Optimization of λ with approximated ˆ(λ) In the previous subsection, we proposed penalty selection criteria for selecting λ. These criteria are made from the optimal obtained by minimizing the GC p criterion. This indicates that we need to repeat the optimization of until obtaining the optimal λ. Hence, a number of computational tasks are required for such an optimization. In this subsection, we try to reduce the number of computational tasks by using the approximated ˆ(λ), which is given by (2.6). Thus, we propose the penalty selection criterion when the approximated ˆ(λ) is used. As such, we calculate (L) (λ)/. The following lemma is useful for obtaining such a derivative (The proof is provided in Appendix A.4.): 7

Lemma 3.. For any l, the first derivative of a l with respect to (Y ) ij is calculated as ( )) a l = 2n l S Y XM0 l 2 X (I n n k Y S Y H where H = I n n n/n XM 0 X. By using this lemma and (2.6), we obtain the following theorem:, ji Theorem 3.2. The PMSE of Ŷ (L) (λ) is expressed as where PMSE[Ŷ (L) (λ) ] = E Y [tr(w (L) (λ) Σ ) + 2p{tr(M (L) (λ) M 0) + }] + 2E Y [B ( (L) (λ))], B ( (L) (λ)) = 2n a () (λ)tr(m 2 0 M 2 (L) (λ) V ) + tr(m λa 2 0 2 M 2 i= V ) L (L) (λ) j= l=0 n l ( )l+ (l + ) l (L )(λ){λ(l + 2)a l+ (L ) (λ) 2pb l+ } L 2λa n l ( )l+ (l + ) l (L ) {λ(l + )(l + 2)a l+ (L ) (λ) 2plb l+ } l=0 n p (XM 2 (L) (λ) X Y ) (L ) ij n L ( ) l+ (l + )(l + 2) l+ a l=0 (L ) tr(m (l+2) 0 M 2 (L) (λ) V ). The proof of this theorem is presented in Appendix A.5. When () (λ) is used, we obtain PMSE[Ŷ () (λ) ] = E Y [tr(w () (λ) Σ ) + 2p{tr(M M () (λ) 0) + }] [ ] 4n () (λ) + E Y tr(m0 2 M 2 a V ). () (λ) Thus, by neglecting terms that are independent of λ, the penalty selection criteria with () (λ) are defined as follows: Definition 3.2. Penalty selection criteria to optimize λ when () (λ) is used are defined as follows: C p () (λ) = tr(w () (λ) S ) + 2ptr(M M () (λ) 0) + 4n () (λ)tr(m0 2 M 2 a V ), () (λ) MC p () (λ) = c M tr(w () (λ) S ) + 2ptr(M M () (λ) 0) + 4n () (λ)tr(m0 2 M 2 a V ), () (λ) where () (λ) = pb /(λa ) and c M = (p + )/(n k ). 8

Similar to MC p # (λ), MC p () (λ) can be regarded as simple bias-corrected C p () (λ). At least, when () (λ) = 0, MC p () (λ) completely corrects the bias of C p () (λ). If we use a (L) (λ) other than () (λ), the penalty selection criteria becomes more complicated as the number of L increases. As an example, we describe the penalty selection criteria for (2) (λ) in Appendix A.6. From the viewpoint of an application, C p () (λ) and MC p () (λ) are useful because these are the simplest criteria among all L. When we use C p () (λ) and MC p () (λ), the optimal and λ are given as follows: ˆλ () M ˆλ () C = arg min λ [0, ] C() p (λ) and = arg min λ [0, ] MC() p (λ) and ˆ(ˆλ () C ) = () (ˆλ () C ), ˆ(ˆλ () M ) = () (ˆλ () M ). 3.3. Asymptotic optimization for λ In previous subsections, we proposed the penalty selection criteria. When such criteria are used to optimize λ, we must perform an iterative procedure. In this subsection, we consider the non-iterative optimization of λ. This requires the calculation of an asymptotic optimal λ, which minimizes an asymptotic expansion of PMSE[Ŷˆ(λ) ] among λ [0, ]. The following theorem gives such an asymptotic optimal value of λ (The proof is provided in Appendix A.7.): Theorem 3.3. An asymptotic optimal λ minimizes PMSE[Ŷˆ(λ) ] asymptotically is given by λ = E Y [a ] E Y [a /a 2 ] 2E Y [a 2 /a 2 ] pb E Y [a /a 2 ], where V = X Y Σ Y X and a j = n j tr(v M (j+2) 0 ). By replacing a with a, we estimate λ as follows: ˆλ 0 = { 2tr(M 4 } 0 V ) ptr(m0 3 V )tr(m0. (3.4) ) Note that E Y [a ] = c M E Y [a ] holds. Hence, we can estimate E Y [a ] as c M a. This implies new estimator of λ given by ˆλ M = c Mˆλ0. When we use ˆλ 0 and ˆλ M, optimal is given by ˆ(ˆλ 0 ) = arg min GC p (, ˆλ 0 ) and ˆλ 0 is in (3.4), [0, ] ˆ(ˆλ M ) = arg min GC p (, ˆλ M ) and ˆλ M = c Mˆλ0. [0, ] 4. Numerical Study 9

In this section, we conduct numerical studies to compare the PMSEs of predictors of Y consisting of the ridge regression estimators with optimized ridge and penalty parameters. Let R q and q (ρ) be q q matrices defined by R q = diag(,..., q) and ( q (ρ)) ij = ρ i j. The explanatory matrix X was generated from X = W Ψ /2, where Ψ = R /2 k k (ρ x )R /2 k, and W is an n k matrix, the elements of which were generated independently from the uniform distribution on (, ). The k p unknown regression coefficient matrix Ξ was defined by Ξ = δf Ξ i, where δ is a constant, and F is defined as F = diag( κ, 0 k κ ), which is a k k matrix, and Ξ i is defined as the first five rows of Ξ 0 when k = 5, Ξ 0 when k = 0, and Ξ when k = 5. Ξ 0 = 0.850 0.657 0.259 0.2753 0.2432 0.87 0.393 0.2926 0.67 0.2754 0.2608 0.766 0.2693 0.264 0.2066 0.0676 0.0663 0.056 0.2239 0.297 0.880 0.0352 0.0346 0.0305 0.3240 0.399 0.2868 0.3747 0.3727 0.3554, Ξ =.3794 0.0645 0.0330 0.0766 0.024 0.043 0.268 0.396 0.095 0.469 0.2589 0.798 0.238 0.488 0.082 0.240 0.463 0.2 0.3002 0.2364 0.950 0.55 0.0953 0.082 0.2774 0.2395 0.209 0.3392 0.3072 0.2807 0.006 0.007 0.000 0.0438 0.0408 0.038 0.387 0.3039 0.2904 0.0529 0.050 0.0493 0.2505 0.245 0.2399 Here, δ controls the scale of the regression coefficient matrix, and F controls the number of non-zero regression coefficients via κ (the dimension of the true model).. The values of the elements of Ξ 0 and Ξ, which is an essential regression coefficient matrix, are the same as in Lawless (98). Simulated data values Y were generated by N n 3 (XΞ, Σ I n ) repeatedly under several selections of n, k, κ, δ, ρ y, and ρ x, where Σ = R /2 3 3 (ρ y )R /2 3, and the number of repetition was 000. At each repetition, we evaluated r(xξ, Ŷˆ) = tr{(xξ Ŷ) (XΞ Ŷ)Σ }, where Ŷˆ = n ˆµ + X ˆΞˆ, which is the predicted value of Y obtained from each method. The average of np + r(xξ, Ŷˆ) across 000 repetitions was regarded as the PMSE of Ŷˆ. In the simulation, a standardized X was used to estimate the regression coefficients. Recall that GC p (, λ) is defined in (2.). Here, λ and are optimized by the following methods: Method : ˆ = arg min GC p (, ˆλ) and ˆλ = ˆλ 0, where ˆλ 0 is defined in (3.4). [0, ] Method 2: ˆ = arg min GC p (, ˆλ) and ˆλ = ˆλ M = c Mˆλ0, where c M = (p + )/(n k ). [0, ] 0

Method 3: ˆ = arg min GC p (, ˆλ) and ˆλ = arg min C p # (λ), where C p # (λ) is given by [0, ] λ [0, ] Definition 3.. Method 4: ˆ =arg min GC p (, ˆλ) and ˆλ=arg min MC p # (λ), where MC p # (λ) is given by [0, ] λ [0, ] Definition 3.. Method 5: ˆ = () (ˆλ) and ˆλ = arg min λ [0, ] C() p (λ), where () (λ) and C p () (λ) are defined in (2.6) and by Definition 3.2. Method 6: ˆ = () (ˆλ) and ˆλ = arg min MC p () (λ), where MC p () (λ) is given by Definition λ [0, ] 3.2. For the purpose of comparison with the proposed methods, we prepare conventional optimization methods, which are obtained using the following methods: Method 7: ˆC = arg min C p () = arg min GC p (, ). [0, ] [0, ] Method 8: ˆM = arg min MC p () = arg min GC p (, c M ), where c M = (p + )/(n [0, ] [0, ] k ). In Methods 3 through 8, the fminsearch function in Matlab is used to find the minimizer of the penalty selection criterion or model selection criterion. In the fminsearch function, the Nelder-Mead simplex method (see, e.g., Lagarias et al., 998) is used to search the value that minimizes the function. When Methods through 4 are used, an optimal is searched using the fminsearch function. We can see that computational speeds of Methods and 2 are the same as those of Methods 5 and 6. Furthermore, the computational speeds of Methods 5 and 6 are almost the same as those of Methods 7 and 8 because these four methods optimize one parameter. It is easy to predict that the computational speeds of Methods 3 and 4 are slower than the other methods because Methods 3 and 4 optimize two parameters simultaneously. In this paper, we proposed Methods through 6 as referred to above, and these methods can be regarded as the estimation methods for the optimal λ. To obtain the optimal λ, called λ, which minimizes the PMSE, we divided [0, 2] into 00 parts and used each point. Then we compute r(xξ, Ŷˆ) for each point in each repetition. After 000 repetitions, we compute the averages of these values for each point which are regarded as the main term of PMSE of Ŷˆ. By comparing the average values, the λ is obtained. For comparing ˆλ which is estimated λ by using each method in above Methods through 6, we show the Figure in some situations which shows the box plots of each ˆλ in 000 repetitions. The horizontal line

means the λ. Tables through 4 show the averages of (ˆλ λ ) 2 across 000 repetitions, which is referred as the mean squared error (MSE) of ˆλ, for each method. In the Theorem 2.2, we derived the expansion for ˆ(λ) and we suggested to use the first term of the expansion which is referred as () (λ). To compare the ˆ(λ) and () (λ), we show the scatter plots in some situations when we fix λ as or 2 in Figure 2. In each scatter plot, the line means the 45-degree line which means the line of ˆ(λ) = () (λ). When the scatter plot close up this line, () (λ) closes to ˆ(λ). Tables 7 through 2 show the simulation results obtained for PMSE[Ŷˆ]/{p(n +k + )} 00 for the cases in which (k, n) = (5, 30), (5, 50), (0, 30), (0, 50), (5, 30), and (5, 50), respectively, where p(n + k + ) is the PMSE of the predictor of Y derived using the LS estimators. We note p = 3 in numerical studies. In the tables, bold font indicates the minimized PMSE, and italic font indicates the second-smallest PMSE. Please insert figures and 2 around here Please insert tables, 2, 3, 4, 5, 6, 7, 8, 9, 0, and 2 around here From the figure shows the box plots for ˆλ for each method, we can see that the dispersion of ˆλ and the differences between ˆλ and λ. Methods 2, 4 and 6 are always smaller value than Methods, 3 and 5. The dispersion of Methods 2, 4 and 6 are smaller than Methods, 3 and 5. This facts mean that the correction of each method make the optimized value and dispersion smaller. We note that the dispersions of Methods and 2 are smaller than other methods. When ρ y and ρ x are small, our optimization method is nearly equal to λ. From the tables through 6, we can see that the numerical evaluation for each method. When k and δ are zeros, Method 6 is the best and Method 5 is the second best. Methods 2 and 4 are the best and the second best when κ and δ is small. When κ is equal to k = 0 and δ is large, Method 6 is the best method. On the other hand, when k = κ = 5 and δ is large, Method 6 or 2 is the best in ρ y is small or large. Consequently, Method 6 and 5 was, on average, the best and the second best method except k = 5. When k = 5, Method 5 was the best. Hence we recommend using Method 6 to optimize the penalty parameter λ. From the figure 2 shows the scatter plot for () (λ) and ˆ(λ), we can see the dispersion in each situation. We note that λ become large, the difference between ˆ(λ) and () (λ) becomes small. Also when ρ y or n become large, the difference between ˆ(λ) and () (λ) becomes small. On the other hand, the difference between ˆ(λ) and () (λ) becomes large ρ x is large. In almost case, () (λ) is smaller than ˆ(λ). This fact is corresponding the result in the Theorem 2.2 since ˆ(λ) = () (λ) + O(n ). When the ρ y or δ become large, the dispersion of ˆ(λ) becomes small. The each value of ˆ(λ) and () (λ) become small when ρ y, ρ x, δ or λ becomes large. 2

Based on the simulations, we can see that all of the methods improved the PMSEs of the LS estimators in almost all cases. All of the methods greatly improved the PMSE when n becomes small or k becomes large. Moreover, the improvement in the PMSE of the proposed method increases as ρ y decreases. The improvement in the PMSE when κ 0 and δ 0 of the proposed method increases as ρ x increases. Comparison of several methods reveals that Methods 2 and 4 were better than Methods and 3, respectively, in almost all cases when ρ x is large. When k and ρ x become large, Methods 5 and 6 provide a greater improvement in PMSE than Methods 3 and 4. When k becomes small and n becomes large, Methods 4 and 6 improve the PMSE more than Methods 2 and 4 in most cases. Occasionally, Method 7 improves the PMSE more than Method 8, especially when κ and δ become large. Consequently, Method 6 was, on average, the best method. In particular, it strongly improved the PMSE when δ and κ are small. Based on these results, we recommend using Method 6 to optimize the multivariate ridge regression. Acknowledgments The author wish express my deepest gratitude to Dr. Hirokazu Yanagihara of Hiroshima University for his valuable advice and encouragement and introducing me to various fields of mathematical statistics. In addition, I would like to thank Professor Hirofumi Wakaki for previous comments and Professor Yasunori Fujikoshi of Hiroshima University for providing helpful comments and suggestions. Also, I thank to Dr. Kengo Kato of Hiroshima University for his encouragement and comments. Futhemore, I sincere thank Professor Megu Ohtaki, Dr. Kenichi Satoh, and Dr. Tetsuji Tonda of Hiroshima University for their encouragement and valuable comments and introducing me to various fields of mathematical statistics. I also would like to thank Dr. Tomoyuki Akita, Dr. Shinobu Fujii, and Dr. Kunihiko Taniguchi of Hiroshima University for their advice and encouragement. Special thanks are due to Professor Takashi Kanda of Hiroshima Institute of Technology for his encouragement and his recommendation to attend Hiroshima University. Appendix A.. Proof of Theorem 2. In this subsection, we prove Theorem 2., which shows the properties of ˆ(λ). Using d j 3

and z j in (2.4), we can write tr(w S ) and tr(m M 0 ) in (2.) as g() = tr(w S ) = tr(y Y S ) 2 h() = tr(m M 0 ) = k j= Since d j > 0 and 0, we have d j d j +. k j= z j 2 d j + + k j= z j 2 d j (d j + ) 2 n ˆµ S ˆµ, ġ() = g() = 2 k j= z j 2 0, (A.) (d j + ) 3 with equality if and only if = 0 or, and ḣ() = h() = k j= d j 0, (A.2) (d j + ) 2 with equality if and only if. Therefore, g() and h() are strictly monotonic increasing and decreasing functions of [0, ], respectively. Since GC p (, λ) = λg() + 2ph(), these results imply that ˆ( ) = arg min [0, ] GC p (, ) = arg min [0, ] tr(w S ) = 0, ˆ(0) = arg min GC p (, 0) = arg min tr(m M 0 ) =. [0, ] [0, ] On the other hand, from (A.) and (A.2), we derive GC p (, λ) = GC p (, λ) = 2 k j= pd 2 j(r λ,j ) (d j + ) 3 = k ϕ j ( λ), j= where r λ,j is given by (2.5). Note that ϕ j ( λ) 0 when [0, (r + λ, ) ]. Therefore, GC p (, λ) < 0 when [0, (r + λ, ) ]. These imply that GC p (, λ) is a monotonic decreasing function with respect to [0, (r + λ, ) ]. Thus, ˆ(λ) > (r + λ, ). On the other hand, if max j=,...,k r λ,j 0 is satisfied, ϕ j ( λ) 0 holds for any. This fact means GC p (, λ) is a monotonic decreasing function with respect to [0, ]. Hence, we can see that ˆ(λ) = when max j=,...,k r λ,j 0. Since max j=,...,k r λ,j 0 holds when λ < min j=,...,k pd j / z j 2. Thus, ˆ(λ) = when λ < min j=,...,k pd j / z j 2 is satisfied. Using Equation (2.3) and GC p (, λ) = λġ() + 2pḣ(), we have where GC λ p (ˆ(λ), λ) = ġ(ˆ(λ)) + ˆ(λ) GC p (ˆ(λ), λ) = 0, λ (A.3) GC p (, λ) = 2 GC p (, λ)/( 2 ). Since ˆ(λ) satisfies (2.2), i.e., ˆ(λ) is the minimizer of GC p (, λ), GC p (, λ) is a convex function around the neighborhood of ˆ(λ). Hence, we 4

have GC p (ˆ(λ), λ) > 0. Using this result and Equation (A.3), we obtain ˆ(λ) λ = ġ(ˆ(λ)) GC p (ˆ(λ), λ). We derive ġ(ˆ(λ)) 0 because ġ() is a strictly monotonic decreasing function of [0, ]. Hence, ˆ(λ)/( λ) 0 is obtained. This implies that ˆ(λ) is a monotonic decreasing function with respect to λ. A.2. Proof of Theorem 2.2 In this subsection, we present the proof of Theorem 2.2, which describes the expansion of ˆ(λ). In order to prove this theorem, we expand the GC p criterion in (2.) under fixed λ. Recall that X n = 0 k, (n k )S = Y (I n n n/n XM 0 X )Y, and Q X XQ = D. Hence, we derive W S = Y (I n n n/n XM X ) 2 Y S = (n k )I p + Y X(M 0 2M + M M 0 M )X Y S = (n k )I p + Y XQD /2 {I k D(D + I k ) } 2 D /2 Q X Y S. Based on this result, the GC p criterion is expressed as GC p (, λ) = λ(n k )p + λ 2 tr{d /2 Q V QD /2 (D + I k ) 2 } + 2p{D(D + I k ) }. Letting t j = (D /2 Q V QD /2 ) jj, we obtain GC p (, λ) = λ(n k )p + { k ( λ + ) 2 ) } 2 t i + 2p ( + di. d i d 2 i i= By Taylor expansion around = 0, we have GC p (, λ) = λ(n k )p + λ k i= = (λ(n k ) + 2k)p + = (λ(n k ) + 2k)p + ( ) l+ l l+ t i + 2p l= d l+ i { λ( ) l+ l l+ l= l= k i= ( k i= ) ( ) l+ l l= d l i t i 2p( ) l+ l d l+ i k i= ( ) l+ l {λltr(v M (l+2) 0 ) 2ptr(M l 0 )}. Recall that a j = n j tr(v M (j+2) 0 ) and b j = n j tr(m j 0 ). It follows that GC p (, λ) = (λ(n k ) + 2k)p + lim L L ( ) l+ l {λla n l l 2pb l }. l= } d l i 5

Then, the following equation is derived: GC p(, λ) = lim L L ( ) l+ l l {λ(l + )a n l l 2pb l }. l= Using the above equation and ˆ(λ) satisfying (2.3), we obtain the equation in Theorem 2.2. A.3. Proof of Theorem 3. In this subsection, we prove Theorem 3., which shows the risk function with respect to λ. Recall that g() = tr(w S ), h() = tr(m M 0 ) and GC p (, λ) = λg() + 2ph(). Since ˆ(λ) satisfies (2.3), we obtain ( 0 = λ g() + 2p h() ) =ˆ(λ) =ˆ(λ) = ˆ(λ) { } λ g(ˆ(λ)) + 2pḧ(ˆ(λ)) + λ ġ(ˆ(λ)), where ġ(ˆ(λ)) = g()/( ) =ˆ(λ), g(ˆ(λ)) = 2 g()/( ) 2 =ˆ(λ), and ḣ(ˆ(λ)) = h()/( ) =ˆ(λ). Thus, we obtain because, from Appendix A., ˆ(λ) = λ( ġ(ˆ(λ)))/( ), GC p (ˆ(λ), λ) By simple calculation, we have ġ(ˆ(λ)) = 2ˆ(λ)tr(M 3 ˆ(λ) V ). GC p (ˆ(λ), λ) = λ g(ˆ(λ)) + 2pḧ(ˆ(λ)) > 0. As in the proof of Lemma 3., which is given in Appendix A.4, we obtain tr(m 3 ˆ(λ) V )/() = 2(S Y XM 3 ˆ(λ) X {I n Y S Y H/(n k )}) ji. Hence, we derive n i= p j= because X n = 0 k and M 3 ˆ(λ) M 0M 2 ˆ(λ) GC p (, λ) = λġ() + (XM 2 ˆ(λ) X Y ) ˆ(λ) 4λˆ(λ)tr(V M 5 ˆ(λ) M 0) ij =, GC p (ˆ(λ), λ) 2pḣ() = 2{λtr(M 3 = M 5 ˆ(λ) M 0. V ) ptr(m 2 M 0 )} and By simple calculation, we have GC p (ˆ(λ), λ) = 2{λtr(M 3 4 3 V ) 3λˆ(λ)tr(M V ) + 2ptr(M ˆ(λ) ˆ(λ) ˆ(λ) M 0)}. Thus, the theorem is proved. A.4. Proof of Lemma 3. In this subsection, we prove Lemma 3., which shows the derivative of a l for any l. Since a l = n l tr(v M (l+2) 0 ), M 0 = X X, and V = X Y S Y X, we need only obtain the 6

derivative of V. We can see that S = S S S = n k S (e j p e i nhy + Y He i n e j p)s, where e i n is the n-dimensional vector, the ith element of which is one and other elements of which are zeros. Thus, we obtain V = X e i n e (Y ) j ps Y X + X Y S e j p e i nx ij n k X Y S (e j p e i nhy + Y He i n e j p)s Y X. From a l /( ) = n l tr{m (l+2) 0 ( V )/( )}, we derive this lemma. From (3.2), we obtain A.5. Proof of Theorem 3.2 PMSE[Ŷ (L) (λ) ] = E Y [tr(w (L) (λ) Σ ) + 2p{tr(M M (L) (λ) 0) + }] [ n p 2E Y (XM 2 (L) (λ) X Y ) ] (L) (λ) ij. (Y ) i= j= ij Hence, we need only calculate (L) (λ)/( ). Using Theorem 2.2, we derive the derivative as follows: (L) (λ) = () (λ) + [ 2λa L l=0 n l ( )l+ (l + ) ] (L )(λ){λ(l l + 2)a l+ (L ) (λ) 2pb l+ }. Recall that (0) (λ) = 0. From Lemma 3., we have () (λ) = pb λa 2 a = 2n () (λ) a ( )) S Y XM0 3 X (I n n k Y S Y H, ji 7

and [ = n λa 2 L l=0 + L 2λa L ] n l ( )l+ (l + ) (L )(λ){λ(l l + 2)a l+ (L ) (λ) 2pb l+ } )) 2λa l=0 ( ( S Y XM0 3 X I n Y S Y H n k n l ( )l+ (l + ) l (L )(λ){λ(l + 2)a l+ (L ) (λ) 2pb l+ } l=0 n l ( )l+ (l + ) l (L ) {λ(l + )(l + 2)a l+ (L ) (λ) 2plb l+ } (L ) + n L ( ) l+ (l + )(l + 2) l+ a l=0 Thus, we obtain n p (XM 2 (L) (λ) X Y ) ij i= j= [ 2λa = n tr(m λa 2 0 2 M 2 i= j= L l=0 (L ) ji ( S Y XM (l+3) 0 X (I n n k Y S Y H n l ( )l+ (l + ) ] (L )(λ){λ(l l + 2)a l+ (L ) (λ) 2pb l+ } V ) L (L) (λ) l=0 )) n l ( )l+ (l + ) l (L )(λ){λ(l + 2)a l+ (L ) (λ) 2pb l+ } + L 2λa n l ( )l+ (l + ) l (L ) {λ(l + )(l + 2)a l+ (L ) (λ) 2plb l+ } l=0 n p (XM 2 (L) (λ) X Y ) (L ) ij + n L ( ) l+ (l + )(l + 2) l+ a and l=0 n i= p j= (l+2) (L ) tr(m0 M 2 (XM 2 (L) (λ) X Y ) ij () (λ) (L) (λ) V ), We derive this theorem by substituting these results into (3.2). = 2n () (λ) tr(m0 2 M 2 a V ). (L) (λ) A.6. The criteria for optimizing λ when we use (2) (λ) In this subsection, we calculate the criterion for optimizing λ when we use (2) (λ). From (2.6), we obtain (2) (λ) = () (λ) + () (λ){3λa 2 () (λ) 2pb 2 } nλa = () (λ) + { n ()(λ) 2 3 a 2 2 b } 2. a b 8. ji

Since b l = n l tr(m0 l ) does not depend on (Y ) ij, we derive { ()(λ) 2 (2) (λ) = () (λ) = () (λ) = () (λ) + n { 2 () (λ) + n { + 2 () (λ) n ( 3 a 2 ( 3 a 2 a 2 b 2 b ( 3 a 2 a 2 b 2 b )} 2 b ) } 2 + 3 2 a b ()(λ) a 2/a ) } + 3 () 2 (λ) a 2 /a. n Hence, the third term of (3.) is obtained using the result of () (λ)/( ) as follows: n p (XM 2 (2) (λ) X Y ) (2) (λ) ij (Y ) i= j= ij = 2n { () (λ) + 2 ( () (λ) 3 a 2 2 b ) } ( ) 2 tr M0 2 M 2 a n a b V (2) (λ) + 3 () 2 (λ) n p ( ) (XM 2 a 2 a na 2 (2) (λ) X Y ) ij a a 2. i= j= From Lemma 3., we obtain 3 () 2 (λ) n p ( ) (XM 2 a 2 a (2) (λ) X Y ) ij a a 2 na 2 i= = 6 2 () (λ) ( Thus, we have n a 2 j= n i= p (XM 2 (2) (λ) X Y ) ij j= S Y XM 3 0 ( ( ) na M0 a 2 I k X I n = 6 () 2 (λ) tr{xm 2 a 2 V M (2) (λ) 0 3 (na M0 a 2 I k )X } = 6 () 2 (λ) tr{m a 2 0 2 M 2 V (na (2) (λ) M0 a 2 I k )}. i= p j= (XM 2 (2) (λ) X Y ) ij (2) (λ) = 2n () (λ) tr(m0 2 M 2 a V ) + 4 () 2 (λ) (2) (λ) a + 6 () 2 (λ) tr{m a 2 0 2 M 2 V (a (2) (λ) 2I k na M0 )} = 2n () (λ) tr(m0 2 M 2 a 8 () 2 (λ)b 2 tr(m0 2 M 2 a b (2) (λ) V ) + 8 2 () (λ)a 2 a 2 (2) (λ) V ) 6n 2 () (λ) 9 a )) n k Y S Y H ji ( 3 a 2 2 b ) 2 tr(m0 2 M 2 a b V ) (2) (λ) tr(m 2 0 M 2 (2) (λ) V ) tr(m 3 0 M 2 (2) (λ) V ).

When we use Theorem 3.2, we obtain the same result. Using this result, we obtain the C p type criteria for optimizing λ are defined as C p (2) (λ) = tr(w () (λ) S ) + 2ptr(M M (2) λ 0) and + 4n () (λ) tr(m0 2 M 2 a 6 () 2 (λ)b 2 tr(m0 2 M 2 a b (2) (λ) V ) + 36 2 () (λ)a 2 a 2 (2) (λ) V ) 2n 2 () (λ) a tr(m 2 0 M 2 (2) (λ) V ) tr(m 3 0 M 2 (2) (λ) V ), MC p (2) (λ) = c M tr(w () (λ) S ) + 2ptr(M M (2) λ 0) + 4n () (λ) tr(m0 2 M 2 a 6 () 2 (λ)b 2 tr(m0 2 M 2 a b (2) (λ) V ) + 36 2 () (λ)a 2 a 2 (2) (λ) V ) 2n 2 () (λ) a A.7. Proof of Theorem 3.3 tr(m 2 0 M 2 (2) (λ) V ) tr(m 3 0 M 2 (2) (λ) V ). In this subsection, we show an asymptotic expansion PMSE[Ŷˆ(λ) ] of and calculation to obtain ˆλ 0. Since PMSE[Ŷˆ(λ) ] is obtained as (3.3), we consider expanding each term for obtaining ˆλ 0. We obtain Wˆ(λ) Σ = Y (I n n n/n XM ˆ(λ) X ) 2 Y Σ = (n k )SΣ + Y X(M 0 2M ˆ(λ) + M ˆ(λ) M 0M ˆ(λ) )X Y Σ = (n k )SΣ + Y XQD /2 (I k D{D + ˆ(λ)I k ) } 2 D /2 Q X Y Σ, because Y (I n n n/n XM 0 X )Y = (n k )S, X n = 0 k and Q X XQ = D. Hence, we obtain tr(wˆ(λ) Σ ) = (n k )tr(sσ ) + 2 tr{(d + I k ) 2 D Q V Q}. Since S is an unbiased estimator of Σ, we have E Y [tr(wˆ(λ) Σ )] = (n k )p + E Y ( ) 2 k ˆ(λ) (Q V Q) jj d j + ˆ(λ). d j j= Then, since d i = O(n) and V = O p (n 2 ), we can expand the above equation as follows: [ k ] ˆ E Y [tr(wˆ(λ) Σ 2 (λ) )] = (n k )p + E Y (Q V Q) d 3 jj + O p (n 2 ) j= j [ ] a = (n k )p + E ˆ 2 (λ) Y + O p (n 2 ). n 20

From simple calculation by Taylor expansion and noting that a j = O p () and b j = O(), we derive tr(m ˆ(λ) M 0) = k b ˆ(λ) + O p (n 2 ), n λˆ(λ)tr(v M 5 ˆ(λ) M 0) = λˆ(λ)a 2 n 2 + O p (n 3 ), λtr(v M 3 ˆ(λ) ) = λa n + O p(n 2 ), λtr(v M 4 ˆ(λ) ) = λˆ(λ)a 2 n 2 + O p (n 3 ), tr(m 3 ˆ(λ) M 0) = b 2 n 2 + O p(n 3 ). By substituting these results into PMSE[Ŷˆ(λ) ] in (3.3), we obtain the asymptotic expansion of PMSE[Ŷˆ(λ) ], as follows: [ ] a PMSE[Ŷˆ(λ) ] = (n + k + )p + E ˆ 2 (λ) Y 2pb ˆ(λ) + 4a 2ˆ(λ) + O p (n 2 ) n n na From (2.6), which is proved in Appendix A.2, ˆ(λ) = pb /(λa ) + O p (n ). consider minimizing the following approximated PMSE: PMSE[Ŷˆ(λ) ] = (n + k + )p + E Y [ pb na ( pb a λ 2 a 2pb λ + 4a 2 λa )] + O(n 2 ).. Hence, we Hence we obtain the asymptotic optimal λ, which minimizes the second term of the above equation as in Theorem 3.3. References [] A. C. Atkinson, A note on the generalized information criterion for choice of a model, Biometrika, 67 (980), 43-48. [2] P. J. Brown and J. V. Zidek, Adaptive multivariate ridge regression, Ann. Statist., 8 (980), 64 74. [3] B. Efron, The estimation of prediction error: covariance penalties and cross-validation, J. Amer. Statist. Assoc., 99 (2004), 69 632. [4] S. J. V. Dien, S. Iwatani, Y. Usuda and K. Matsui, Theoretical analysis of amino acid-producing Eschenrichia coli using a stoixhiometrix model and multivariate linear regression, J. Biosci. Bioeng., 02 (2006), 34 40. [5] Y. Fujikoshi and K. Satoh, Modified AIC and C p in multivariate linear regression, Biometrika, 84 (997), 707 76. 2

[6] Y. Fujikoshi, H. Yanagihara and H. Wakaki, Bias corrections of some criteria for selecting multivariate linear models in a general nonnormal case, Amer. J. Math. Management Sci., 25 (2005), 22-258. [7] Y. Haitovsky, On multivariate ridge regression, Biometrika, 74 (987), 563 570. [8] A. E. Hoerl and R. W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 2(970), 55 67. [9] J. C. Lagarias, J. A. Reeds, M. H. Wright and P. E. Wright, Convergence properties of the Nelder-Mead simplex method in low dimensions, SIAM J. Optim., 9 (998), 2 47. [0] J. F. Lawless, Mean squared error properties of generalized ridge estimators, J. Amer. Statist. Assoc., 76 (98), 462 466. [] C. L. Mallows, Some comments on C p, Technometrics, 5 (973), 66 675. [2] C. L. Mallows, More comments on C p, Technometrics, 37 (995), 362 372. [3] C. Sârbu, C. Onisor, M. Posa, S. Kevresan and K. Kuhajda, Modeling and prediction (correction) of partition coefficients of bile acids and their derivatives by multivariate regression methods, Talanta, 75 (2008), 65 657. [4] R. Saxén and J. Sundell, 37 Cs in freshwater fish in Finland since 986 a statistical analysis with multivariate linear regression models, J. Environ. Radioactiv., 87 (2006), 62 76. [5] X. Shen and J. Ye, Adaptive model selection, J. Amer. Statist. Assoc., 97 (2002), 20 22. [6] M. Siotani, T. Hayakawa and Y. Fujikoshi, Modern Multivariate Statistical Analysis: A Graduate Course and Handbook, American Sciences Press, Columbus, Ohio, 985. [7] B. Skagerberg, J. MacGregor and C. Kiparissides, Multivariate data analysis applied to low-density polyethylene reactors, Chemometr. Intell. Lab. Syst., 4 (992), 34 356. [8] R. S. Sparks, D. Coutsourides and L. Troskie, The multivariate C p, Comm. Statist. A Theory Methods, 2 (983), 775 793. [9] M. S. Srivastava, Methods of Multivariate Statistics, John Wiley & Sons, New York, 2002. 22

[20] N. H. Timm, Applied Multivariate Analysis, Springer-Verlag, New York, 2002. [2] H. Yanagihara and K. Satoh, An unbiased C p criterion for multivariate ridge regression, J. Multivariate Anal., 0 (200), 226 238. [22] J. Ye, On measuring and correcting the effects of data mining and model selection, J. Amer. Statist. Assoc., 93 (998), 20 3. [23] A. Yoshimoto, H. Yanagihara and Y. Ninomiya, Finding factors affecting a forest stand growth through multivariate linear modeling, J. Jpn. For. Res., 87 (2005), 504 52 (in Japanese). 23

.4 (30, 5, 0,, 0.2, 0.2) 3 (30, 5, 0,, 0.2, 0.95).2 2.5 2 0.8.5 0.6 0.4 0.2 0.5 0 Method Method 2 Method 3 Method 4 Method 5 Method 6 (30, 5, 0,, 0.95, 0.2) 0 Method Method 2 Method 3 Method 4 Method 5 Method 6 (30, 5, 0,, 0.95, 0.95).2 3 2.5 0.8 2 0.6.5 0.4 0.2 0.5 0 0 Method Method 2 Method 3 Method 4 Method 5 Method 6 Method Method 2 Method 3 Method 4 Method 5 Method 6 (30, 5, 5,, 0.2, 0.2) (30, 5, 0, 3, 0.2, 0.2).2.2 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Method Method 2 Method 3 Method 4 Method 5 Method 6 Method Method 2 Method 3 Method 4 Method 5 Method 6.2 (50, 5, 0,, 0.2, 0.2).2 (30, 0, 5, 3, 0.2, 0.2) 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Method Method 2 Method 3 Method 4 Method 5 Method 6 Method Method 2 Method 3 Method 4 Method 5 Method 6 Figure. The box plots for ˆλ in the case of (n, k, κ, δ, ρ y, ρ x ) 24

Table. MSE of ˆλ in which (k, n) = (5, 30) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.470.04.475.45 0.039 0.04 0.95.644.35 2.599 2.5 0.036 0.03 0.95 0.2.327 0.896.239 0.907 0.033 0.03 0.95.593.092 2.482 2.050 0.035 0.02 5 0.2 0.2 0.030 0.002 0.009 0.05 0.8 0.33 0.95 0.305 0.4 0.424 0.266 0.378 0.445 0.95 0.2 0.006 0.06 0.00 0.032 0.055 0.67 0.95 0.340 0.37 0.327 0.69 0.270 0.338 3 0.2 0.2 0.008 0.02 0.004 0.07 0.00 0.030 0.95.577.095.26 0.85 0.303 0.54 0.95 0.2.322 0.98.293 0.894.253 0.86 0.95.56.053.359 0.927 0.677 0.397 Average 0.928 0.624.039 0.782 0.272 0.23 Table 2. MSE of ˆλ in which (k, n) = (5, 50) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.394.48.504.277 0.027 0.05 0.95.654.363 2.645 2.397 0.022 0.02 0.95 0.2.64 0.94.28 0.998 0.023 0.07 0.95.656.364 2.680 2.436 0.020 0.00 5 0.2 0.2 0.00 0.00 0.003 0.003 0.076 0.36 0.95 0.288 0.76 0.43 0.30 0.406 0.442 0.95 0.2 0.003 0.003 0.00 0.008 0.05 0.048 0.95 0.232 0.33 0.87 0.2 0.322 0.377 3 0.2 0.2 0.000 0.03 0.00 0.06 0.002 0.02 0.95.550.28.28.040 0.460 0.337 0.95 0.2.289.065.276.054.253.033 0.95.483.226.335.00 0.928 0.73 Average 0.894 0.726.045 0.897 0.296 0.265 25

Table 3. MSE of ˆλ in which (k, n) = (0, 30) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.293 0.798 0.977 0.773 0.06 0.003 0.95.375 0.849.60.26 0.06 0.004 0.95 0.2.292 0.798 0.930 0.70 0.07 0.004 0.95.332 0.85.539. 0.07 0.004 5 0.2 0.2 0.070 0.00 0.028 0.0 0.3 0.46 0.95 0.328 0.04 0.375 0.202 0.3 0.355 0.95 0.2 0.056 0.000 0.020 0.007 0.45 0.284 0.95 0.303 0.090 0.283 0.27 0.308 0.36 3 0.2 0.2 0.030 0.004 0.0 0.05 0.07 0.088 0.95 0.8 0.03 0.0 0.029 0.329 0.438 0.95 0.2 0.08 0.0 0.007 0.022 0.002 0.054 0.95.369 0.853.30 0.673 0.24 0.06 0 0.2 0.2 0.035 0.003 0.00 0.09 0.267 0.450 0.95 0.283 0.079 0.277 0.4 0.339 0.392 0.95 0.2 0.043 0.002 0.05 0.04 0.2 0.277 0.95 0.303 0.090 0.25 0.07 0.282 0.348 3 0.2 0.2.299 0.80.230 0.755 0.936 0.535 0.95.4 0.879.5 0.657 0.22 0.043 0.95 0.2.307 0.85.259 0.777.059 0.624 0.95.399 0.872.59 0.693 0.233 0.094 Average 0.686 0.395 0.65 0.403 0.255 0.246 26

Table 4. MSE of ˆλ in which (k, n) = (0, 50) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.73 0.940 0.977 0.862 0.06 0.007 0.95.282.028.562.409 0.008 0.004 0.95 0.2.29 0.902 0.890 0.75 0.04 0.007 0.95.285.03.598.403 0.007 0.003 5 0.2 0.2 0.02 0.000 0.005 0.004 0.200 0.287 0.95 0.203 0. 0.83 0.26 0.407 0.437 0.95 0.2 0.007 0.00 0.003 0.004 0.058 0.5 0.95 0.82 0.096 0.20 0.067 0.372 0.44 3 0.2 0.2 0.005 0.002 0.003 0.003 0.00 0.05 0.95 0.04 0.000 0.03 0.024 0.492 0.60 0.95 0.2 0.006 0.00 0.004 0.002 0.00 0.006 0.95.26.06.38 0.97 0.455 0.328 0 0.2 0.2 0.004 0.002 0.00 0.008 0.58 0.248 0.95 0.7 0.087 0.02 0.054 0.426 0.465 0.95 0.2 0.007 0.00 0.004 0.003 0.047 0.04 0.95 0.23 0.054 0.063 0.022 0.430 0.487 3 0.2 0.2.224 0.986.27 0.979.066 0.8444 0.95.325.067.6 0.92 0.229 0.47 0.95 0.2.2 0.976.206 0.97.24 0.897 0.95.304.050.94 0.95 0.459 0.325 Average 0.596 0.468 0.572 0.474 0.298 0.287 27

Table 5. MSE of ˆλ in which (k, n) = (5, 30) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.3 0.660.046 0.76 0.04 0.002 0.95.54 0.762.364 0.940 0.0 0.00 0.95 0.2.27 0.594 0.922 0.540 0.02 0.004 0.95.462 0.726.326 0.86 0.03 0.003 5 0.2 0.2 0.285 0.042 0.63 0.036 0.242 0.328 0.95 0.864 0.328 0.77 0.356 0.080 0.096 0.95 0.2 0.030 0.024 0.03 0.048 0.207 0.47 0.95 0.36 0.044 0.20 0.049 0.338 0.42 3 0.2 0.2 0.02 0.047 0.005 0.069 0.8 0.473 0.95 0.232 0.07 0.4 0.048 0.427 0.522 0.95 0.2 0.004 0.063 0.003 0.068 0.00 0. 0.95 0.39 0.002 0.043 0.08 0.224 0.390 0 0.2 0.2 0.099 0.00 0.038 0.020 0.336 0.506 0.95 0.475 0. 0.340 0.099 0.253 0.296 0.95 0.2 0.044 0.04 0.08 0.033 0.097 0.30 0.95 0.35 0.044 0.69 0.038 0.35 0.398 3 0.2 0.2.308 0.667.39 0.562 0.62 0.249 0.95.542 0.787.45 0.547 0.057 0.03 0.95 0.2 0.002 0.073 0.00 0.084 0.002 0.8 0.95 0.4 0.002 0.026 0.024 0.20 0.376 5 0.2 0.2 0.07 0.005 0.025 0.023 0.260 0.462 0.95 0.468 0.09 0.307 0.074 0.244 0.292 0.95 0.2 0.033 0.020 0.03 0.040 0.074 0.267 0.95.540 0.786.42 0.540 0.050 0.0 3 0.2 0.2.259 0.642.46 0.57 0.84 0.374 0.95.527 0.779.2 0.556 0.04 0.029 0.95 0.2 0.006 0.056 0.004 0.063 0.00 0.083 0.95 0.2 0.002 0.033 0.08 0.49 0.305 Average 0.583 0.264 0.450 0.25 0.92 0.246 28

Table 6. MSE of ˆλ in which (k, n) = (5, 50) Method κ δ ρ y ρ x 2 3 4 5 6 0 0 0.2 0.2.66 0.903 0.867 0.753 0.006 0.002 0.95.256 0.973.247.059 0.004 0.00 0.95 0.2.67 0.904 0.86 0.70 0.007 0.003 0.95.254 0.972.63 0.997 0.004 0.002 5 0.2 0.2 0.023 0.00 0.023 0.025 0.520 0.625 0.95 0.289 0.63 0.20 0.4 0.38 0.340 0.95 0.2 0.002 0.007 0.00 0.009 0.073 0.54 0.95 0.044 0.006 0.00 0.006 0.555 0.630 3 0.2 0.2 0.000 0.05 0.000 0.07 0.053 0.26 0.95 0.035 0.003 0.0 0.007 0.557 0.645 0.95 0.2 0.006 0.04 0.006 0.042 0.03 0.059 0.95 0.004 0.005 0.000 0.05 0.7 0.205 0 0.2 0.2 0.02 0.000 0.004 0.005 0.278 0.392 0.95 0.74 0.080 0.078 0.036 0.44 0.472 0.95 0.2 0.004 0.004 0.003 0.006 0.034 0.095 0.95 0.063 0.04 0.02 0.005 0.452 0.532 3 0.2 0.2.7 0.92.58 0.900 0.946 0.74 0.95.26 0.982.069 0.87 0.26 00.073 0.95 0.2 0.003 0.034 0.003 0.034 0.008 0.046 0.95 0.004 0.005 0.000 0.04 0.088 0.68 5 0.2 0.2 0.006 0.003 0.00 0.0 0.232 0.345 0.95 0.42 0.059 0.060 0.020 0.449 0.494 0.95 0.2 0.002 0.007 0.00 0.009 0.030 0.088 0.95 0.053 0.00 0.05 0.002 0.442 0.523 3 0.2 0.2.97 0.932.84 0.920.002 0.76 0.95.270 0.989.056 0.82 0.92 0.22 0.95 0.2 0.007 0.002 0.007 0.002 0.003 0.005 0.95 0.06 0.000 0.006 0.003 0.05 0.5 Average 0.380 0.287 0.322 0.263 0.250 0.276 29

50 (30, 5, 0,, 0.2, 0.2, ) 30 (30, 5, 0,, 0.2, 0.2, 2) 45 40 25 35 30 20 25 5 20 5 0 0 5 5 0 0 2 4 6 8 0 2 4 6 8 20 20 8 6 4 2 0 8 6 4 2 0 0 2 3 4 5 6 7 8 7 6 5 4 3 2 (30, 5, 0,, 0.95, 0.2, ) 0 0 0.5.5 2 2.5 3 3.5 00 90 80 70 60 50 40 30 20 0 (30, 5, 0, 3, 0.2, 0.2, ) (50, 5, 0,, 0.2, 0.2, ) 0 0 5 0 5 20 25 0 0 2 3 4 5 6 7 8 9 0 20 8 6 4 2 0 8 6 4 2 (30, 5, 0,, 0.2, 0.95, ) 0 0 0.2 0.4 0.6 0.8.2.4 5 4.5 4 3.5 3 2.5 2.5 0.5 (30, 5, 0,, 0.95, 0.95, 2) 0 0 0. 0.2 0.3 0.4 0.5 0.6 4 2 0 8 6 4 2 (30, 0, 5, 3, 0.2, 0.2, ) 0 2 3 4 5 6 7 8 9 0 Figure 2. The scatter plots for () (λ) and ˆ(λ) in the case of (n, k, κ, δ, ρ y, ρ x, λ) 30

Table 7. Simulation results for the case in which (k, n) = (5, 30) Method κ δ ρ y ρ x 2 3 4 5 6 7 8 0 0 0.2 0.2 88.34 87.53 88.00 87.6 87.30 86.76 87.44 86.88 0.95 90.24 89.05 90.52 89.32 88.34 87.46 88.63 87.73 0.95 0.2 88.49 87.64 88.5 87.28 87.39 86.82 87.53 86.94 0.95 90.25 89.03 90.37 89.5 88.28 87.4 88.56 87.67 5 0.2 0.2 94.80 94.72 94.8 94.90 94.87 95.0 94.72 94.9 0.95 92.7 9.54 92.09 9.48 9.25 90.96 9.34 9.02 0.95 0.2 97.34 97.3 97.34 97.36 97.37 97.46 97.29 97.4 0.95 93.47 92.94 93.47 93.00 92.74 92.43 92.76 92.45 3 0.2 0.2 98.92 98.93 98.92 98.94 98.92 98.95 98.9 98.98 0.95 95.44 95.23 95.69 95.75 95.40 95.43 95.9 95.2 0.95 0.2 99.37 99.37 99.37 99.38 99.37 99.38 99.37 99.40 0.95 97.6 97.54 97.66 97.82 97.72 97.86 97.54 97.67 Average 93.87 93.40 93.87 93.46 93.25 93.00 93.27 93.02 Table 8. Simulation results for the case in which (k, n) = (5, 50) Method κ δ ρ y ρ x 2 3 4 5 6 7 8 0 0 0.2 0.2 92.22 9.94 92.06 9.70 9.63 9.45 9.73 9.54 0.95 93.29 92.9 93.52 93.08 92.0 9.80 92.27 9.98 0.95 0.2 92.22 9.96 92.08 9.72 9.67 9.49 9.76 9.58 0.95 93.24 92.83 93.42 93.0 92.05 9.78 92.2 9.93 5 0.2 0.2 97.46 97.45 97.46 97.48 97.52 97.57 97.46 97.52 0.95 94.96 94.79 95.00 94.82 94.54 94.46 94.57 94.48 0.95 0.2 98.75 98.76 98.76 98.77 98.77 98.79 98.77 98.8 0.95 96.9 96.06 96.28 96.20 95.96 95.92 95.9 95.87 3 0.2 0.2 99.57 99.57 99.57 99.57 99.57 99.57 99.57 99.58 0.95 97.70 97.67 97.80 97.84 97.78 97.83 97.66 97.70 0.95 0.2 99.79 99.80 99.79 99.80 99.79 99.80 99.80 99.8 0.95 98.73 98.73 98.74 98.75 98.77 98.80 98.74 98.78 Average 96.8 96.04 96.2 96.06 95.84 95.77 95.87 95.80 3

Table 9. Simulation results for the case in which (k, n) = (0, 30) Method κ δ ρ y ρ x 2 3 4 5 6 7 8 0 0 0.2 0.2 79.07 77.26 78.0 76.70 77.69 76.47 77.86 76.56 0.95 8.9 78.6 80.54 78.24 78.86 77.4 79.3 77.28 0.95 0.2 78.78 77.07 77.90 76.56 77.48 76.33 77.64 76.4 0.95 8.0 78.44 80.29 77.72 78.72 76.89 79.03 77.08 5 0.2 0.2 87.4 86.60 87.2 87.07 86.75 86.88 86.72 86.76 0.95 83.49 8.63 82.99 8.4 8.83 80.74 82.0 80.86 0.95 0.2 9.60 9.36 9.45 9.52 9.46 9.66 9.38 9.56 0.95 85.2 83.47 84.66 83.2 83.70 82.59 83.87 82.68 3 0.2 0.2 95.74 95.7 95.7 95.80 95.72 95.85 95.67 95.83 0.95 88.4 87.52 88.34 87.83 87.67 87.2 87.73 87.23 0.95 0.2 97.50 97.5 97.50 97.56 97.50 97.56 97.47 97.58 0.95 9.85 9.67 9.9 92.63 9.77 92.8 9.66 92.02 0 0.2 0.2 92.02 9.74 9.93 9.93 9.85 92.2 9.76 9.98 0.95 84.62 82.96 84.27 82.9 83.4 82.8 83.29 82.28 0.95 0.2 94.53 94.37 94.50 94.57 94.47 94.68 94.36 94.57 0.95 86.9 84.74 85.8 84.66 84.9 84.02 85.03 84.0 3 0.2 0.2 98.57 98.64 98.59 98.7 98.59 98.72 98.57 98.77 0.95 9.50 90.99 9.59 9.67 9.2 9.23 9.06 9.2 0.95 0.2 99.69 99.73 99.70 99.76 99.70 99.76 99.70 99.8 0.95 94.5 93.83 94.20 94.37 94.00 94.9 93.87 94.02 Average 89. 88.9 88.85 88.24 88.35 87.92 88.39 87.93 32

Table 0. Simulation results for the case in which (k, n) = (0, 50) Method κ δ ρ y ρ x 2 3 4 5 6 7 8 0 0 0.2 0.2 84.93 84.48 84.60 84.09 84.4 84.09 84.5 84.7 0.95 86.22 85.43 85.59 84.96 84.99 84.48 85.9 84.63 0.95 0.2 84.98 84.50 84.62 84.3 84.44 84.0 84.54 84.8 0.95 86.20 85.42 85.58 85.00 85.04 84.60 85.20 84.68 5 0.2 0.2 92.65 92.60 92.67 92.72 92.66 92.74 92.60 92.66 0.95 88.43 87.94 88.4 87.77 87.70 87.40 87.82 87.48 0.95 0.2 95.43 95.43 95.44 95.47 95.45 95.5 95.42 95.50 0.95 90.03 89.69 89.94 89.68 89.55 89.3 89.6 89.37 3 0.2 0.2 97.90 97.92 97.90 97.93 97.90 97.93 97.90 97.95 0.95 93.7 93.09 93.36 93.52 93.5 93.22 93.09 93.4 0.95 0.2 99.27 99.28 99.27 99.28 99.27 99.28 99.27 99.30 0.95 95.5 95.47 95.49 95.53 95.54 95.6 95.48 95.54 0 0.2 0.2 96.50 96.47 96.53 96.55 96.52 96.57 96.47 96.53 0.95 89.7 89.33 89.58 89.32 89.6 88.94 89.23 89.00 0.95 0.2 97.64 97.64 97.66 97.68 97.66 97.70 97.64 97.70 0.95 9.07 90.75 9.09 90.9 90.63 90.48 90.67 90.5 3 0.2 0.2 99.44 99.44 99.44 99.44 99.44 99.44 99.44 99.45 0.95 95.66 95.64 95.85 95.97 95.75 95.89 95.65 95.77 0.95 0.2 99.46 99.46 99.46 99.47 99.46 99.47 99.46 99.47 0.95 97.3 97.29 97.35 97.38 97.36 97.43 97.30 97.37 Average 93.08 92.86 92.98 92.84 92.80 92.7 92.83 92.72 33