An Overview on the Shrinkage Properties of Partial Least Squares Regression

Size: px

Start display at page:

Download "An Overview on the Shrinkage Properties of Partial Least Squares Regression"

Mercy Hawkins
5 years ago
Views:

1 An Overview on the Shrinkage Properties of Partial Least Squares Regression Nicole Krämer 1 1 TU Berlin, Department of Computer Science and Electrical Engineering, Franklinstr. 28/29, D Berlin Summary The aim of this paper is twofold. In the rst part, we recapitulate the main results regarding the shrinkage properties of Partial Least Squares (PLS) regression. In particular, we give an alternative proof of the shape of the PLS shrinkage factors. It is well known that some of the factors are > 1. We discuss in detail the eect of shrinkage factors for the Mean Squared Error of linear estimators and argue that we cannot extend the results to PLS directly, as it is nonlinear. In the second part, we investigate the eect of shrinkage factors empirically. In particular, we point out that experiments on simulated and real world data show that bounding the absolute value of the PLS shrinkage factors by 1 seems to leads to a lower Mean Squared Error. Keywords: linear regression, biased estimators, mean squared error

2 1 1 Introduction In this paper, we want to give a detailed overview on the shrinkage properties of Partial Least Squares (PLS) regression. It is well known (Frank & Friedman 1993) that we can express the PLS estimator obtained after m steps in the following way: β (m) P LS = i f (m) (λ i )z i, where z i is the component of the Ordinary Least Squares (OLS) estimator along the ith principal component of the covariance matrix X t X and λ i is the corresponding eigenvalue. The quantities f (m) (λ i ) are called shrinkage factors. We show that these factors are determined by a tridiagonal matrix (which depends on the inputoutput matrix (X, y)) and can be calculated in a recursive way. Combining the results of Butler & Denham (2000) and Phatak & de Hoog (2002), we give a simpler and clearer proof of the shape of the shrinkage factors of PLS and derive some of their properties. In particular, we reproduce the fact that some of the values f (m) (λ i ) are greater than 1. This was rst proved by Butler & Denham (2000). We argue that these peculiar shrinkage properties (Butler & Denham 2000) do not necessarily imply that the Mean Squared Error (MSE) of the PLS estimator is worse compared to the MSE of the OLS estimator. In the case of deterministic shrinkage factors, i.e. factors that do not depend on the output y, any value f (m) (λ i ) > 1 is of course undesirable. But in the case of PLS, the shrinkage factors are stochastic they also depend on y. In particular, bounding the absolute value of the shrinkage factor by 1 might not automatically yield a lower MSE, in disagreement to what was conjectured in e.g. Frank & Friedman (1993). Having issued this warning, we explore whether bounding the shrinkage factors leads to a lower MSE or not. It is very dicult to derive theoretical (m) results, as the quantities of interest - β P LS and f (m) (λ i ) respectively - depend on y in a complicated, nonlinear way. As a substitute, we study the problem on several articial data sets and one real world example. It turns out that in most cases, the MSE of the bounded version of PLS is indeed smaller than the one of PLS.

3 2 2 Preliminaries We consider the multivariate linear regression model with y = Xβ + ɛ (1) cov (y) = σ 2 I n. (2) Here, I n is the identity matrix of dimension n. The numbers of variables is p, the number of examples is n. For simplicity, we assume that X and y are scaled to have zero mean, so we do not have to worry about intercepts. We have X R n p, A := X t X cov(x) R p p, y R n, b := X t y cov(x, y) R p. We set p = rk (A) = rk (X). The singular value decomposition of X is of the form with X = V ΣU t V R n p, Σ = diag (σ 1,..., σ p ) R p p, U R p p. The columns of U and V are mutually orthogonal, that is we have U t U = I p and V t V = I p. We set λ i = σ 2 i and Λ = Σ 2. The eigendecomposition of A is A = UΛU t = p λ i u i u t i. The eigenvalues λ i of A (and any other matrix) are ordered in the following way: λ 1 λ 2... λ p 0. The Moore-Penrose inverse of a matrix M is denoted by M.

4 3 The Ordinary Least Squares (OLS) estimator β OLS is the solution of the optimization problem arg min β y Xβ. If there is no unique solution (which is in general the case for n > p), the OLS estimator is the solution with minimal norm. Set The OLS estimator is given by the formula We dene This implies s = ΣV t y. (3) β OLS = ( X t X ) X t y = UΛ ΣV t y = UΛ s = z i = vt i y λi u i. p v t i y λi u i. Set β OLS = p z i. K (m) := ( A 0 b, Ab,..., A m 1 b ) R p m. The columns of K (m) are called the Krylov sequence of A and b. The space spanned by the columns of K (m) is called the Krylov space of A and b and denoted by K (m). Krylov spaces are closely related to the Lanczos algorithm (Lanczos 1950), a method for approximating eigenvalues of the matrix A. We exploit the relationship between PLS and Krylov spaces in the subsequent sections. An excellent overview of the connections of PLS to the Lanczos method (and the conjugate gradient algorithm) can be found in Phatak & de Hoog (2002). Set M := {λ i s i 0} = { λ i 0 v t iy 0 } (the vector s is dened in (3)) and m := M. It follows easily that m p = rk(x). The inequality is strict if A has non-zero eigenvalues of multiplicity > 1 or if there is a principal component v i that is not correlated to y, i.e. v t i y = 0. The quantity m is also called the grade of b with respect to A. We state a standard result on the dimension of the Krylov spaces associated to A and b.

5 4 Proposition 1. We have dim K (m) = { m m m m m > m. In particular dim K (m ) = dim K (m +1) =... = dim K (p) = m. (4) Finally, let us introduce the following notation. For any set S of vectors, we denote by P S the projection onto the space that is spanned by S. It follows that P S = S ( S t S ) S t. 3 Partial Least Squares We only give a sketchy introduction to the PLS method. More details can be found e.g. in Höskuldsson (1988) or Rosipal & Krämer (2006). The main idea is to extract m orthogonal components from the predictor space X and t the response y to these components. In this sense, PLS is similar to Principal Components Regression (PCR). The dierence is that PCR extracts components that explain the variance in the predictor space whereas PLS extracts components that have a high covariance with y. The quantity m is called the number of PLS steps or the number of PLS components. We now formalize this idea. The rst latent component t 1 is a linear combination t 1 = Xw 1 of the predictor variables. The vector w 1 is usually called the weight vector. We want to nd a component with maximal covariance to y, that is we want to compute w 1 = arg max cov (Xw, y) = arg max w =1 w =1 wt b. (5) Using Lagrangian multipliers, we conclude that the solution w 1 is up to a factor equal to X t y = b. Subsequent components t 2, t 3,... are chosen such that they maximize (5) and that all components are mutually orthogonal. We ensure orthogonality by deating the original predictor variables X. That is, we only consider the part of X that is orthogonal on all components t j for j < i: X i = X P t1,...,t i 1 X. We then replace X by X i in (5). This version of PLS is called the NIPALS algorithm (Wold 1975).

6 5 Algorithm 2 (NIPALS algorithm). After setting X 1 = X, the weight vectors w i and the components t i of PLS are determined by iteratively computing The nal estimator ŷ for Xβ is w i = Xi T Y weight vector t i = X i w i component X i+1 = X i P ti X i deation ŷ = P t1,...,t m y = m P tj y. The last equality follows as the PLS components t 1,..., t m are mutually orthogonal. We denote by W (m) the matrix that consists of the weight vectors w 1,..., w m dened in algorithm 2: j=1 W (m) = (w 1,..., w m ). (6) The PLS components t j and the weight vectors are linked in the following way (see e.g. Höskuldsson (1988)): (t 1,..., t m ) = XW (m) R (m) with an invertible bidiagonal matrix R (m). Plugging this into (3), we yield ŷ = XW (m) ( ( W (m)) t X t XW (m) ) ( W (m)) t X t y. It can be shown (Helland 1988) that the space spanned by the vectors w i (i = 1,..., m) equals the Krylov space K (m) that is dened in section 2. More precisely, W (m) is an orthogonal basis of K (m) that is obtained by a Gram-Schmidt procedure. This implies: Proposition 3 ((Helland 1988)). The PLS estimator obtained after m steps can be expressed in the following way: β (m) P LS = K (m) [ ( K (m)) t AK (m)] ( K (m)) t b. (7) An equivalent expression is that the P LS estimator of the constrained minimization problem arg min β y Xβ subject to β K (m). β (m) P LS for β is the solution

7 6 Of course, the orthogonal basis W (m) of the Krylov space only exists if dim K (m) = m, which might not be true for all m p. The maximal number for which this holds is m (see proposition 1). Note however that it follows from (4) that K (m 1) K (m ) = K (m +1) =... = K (p), and the solution of the optimization problem does not change anymore. Hence there is no loss in generality if we make the assumption that Remark 4. We have dim K (m) = m. (8) β (m ) P LS = β OLS. Proof. This result is well-known and it is usually proven using the fact that after the maximal number of steps the vectors t 1,..., t m span the same space as the columns of X. Here we present an algebraic proof that exploits the relationship between PLS and Krylov spaces. We show that the OLS estimator is an element of K (m ), that is β OLS = π OLS (A)b for a polynomial π OLS of degree m 1. We dene this polynomial via the m equations In matrix notation, this equals Using (2), we conclude that π OLS (λ i ) = 1 λ i, λ i M. π OLS (Λ)s = Λ s. (9) π OLS (A)b = Uπ OLS (Λ)ΣV t y = Uπ OLS (Λ)s (9) = UΛ s = β OLS. Set D (m) = ( W (m)) t AW (m) R m m. Proposition 5. The matrix D (m) is symmetric and positive semidenite. Furthermore D (m) is tridiagonal: d ij = 0 for i j 2. Proof. The rst two statements are obvious. Let i j 2. As w i K (i), the vector Aw i lies in the subspace K (i+1). As j > i + 1, the vector w j is orthogonal on K (i+1), in other words d ji = w j, Aw i = 0. As D (m) is symmetric, we also have d ij = 0 which proves the assertion.

8 4 Tridiagonal matrices 7 We see in section 6 that the matrices D (m) and their eigenvalues determine the shrinkage factors of the PLS estimator. To prove this, we now list some properties of D (m). Denition 6. A symmetric tridiagonal matrix D is called unreduced if all subdiagonal entries are non-zero, i.e. d i,i+1 0 for all i. Theorem 7 ((Parlett 1998)). All eigenvalues of an unreduced matrix are distinct. Set D (m) = a 1 b b 1 a 2 b a m 1 b m b m 1 a m Proposition 8. If dim K (m) = m, the matrix D (m) is unreduced. More precisely b i > 0 for all i {1,..., m 1}. Proof. Set p i = A i 1 b and denote by w 1,..., w m the basis (6) obtained by Gram-Schmidt. Its existence is guaranteed as we assume that dim K (m) = m. We have to show that b i = w i, Aw i 1 > 0. As the length of w i does not change the sign of b i, we can assume that the vectors w i are not normalized to have length 1. By denition i 1 p i, w k w i = p i w k, w k w k. (10) k=1 As the vectors w i are pairwise orthogonal, it follows that w i, p i = p i, p i > 0.

9 8 We conclude that b i = w i, Aw i 1 ( (10) i 2 = w i, A p i 1 i 2 = wi, p i Ap i 1 =p i (5) = w i, p i (11) = p i, p i > 0 k=1 k 1 ) p i 1, w k w k, w k w k p i 1, w k w k, w k w i, Aw k Note that the matrix D (m 1) is obtained from D (m) by deleting the last column and row of D (m). It follows that we can give a recursive formula for the characteristical polynomials of D (m). We have χ (m) := χ D (m) χ (m) (λ) = (a m λ) χ (m 1) (λ) b 2 m 1χ (m 2) (λ). (11) We want to deduce properties of the eigenvalues of D (m) and A and explore their relationship. Denote the eigenvalues of D (m) by µ (m) 1 >... > µ (m) m 0. (12) Remark 9. All eigenvalues of D (m ) are eigenvalues of A. Proof. First note that A K (m ) : K (m ) K (m +1) = K (m ). As the columns of the matrix W (m ) form an orthonormal basis of K (m ), D (m ) = ( W (m ) ) t AW (m ) is the matrix that represents A K (m ) with respect to this basis. As any eigenvalue of A K (m ) is obviously an eigenvalue of A, the proof is complete.

10 9 The following theorem is a special form of the Cauchy Interlace Theorem. In this version, we use a general result from Parlett (1998) and exploit the tridiagonal structure of D (m). Theorem 10. Each interval [ ] µ (m) m j, µ(m) m (j+1) (j = 0,..., m 2) contains a dierent eigenvalue of D (m+k) ) (k 1). In addition, there is a dierent eigenvalue of D (m+k) outside the open interval (µ (m) m, µ (m) 1 ). This theorems ensures in particular that there is a dierent eigenvalue of A in the interval [ µ (m) k, µ k 1] (m). Theorem 10 holds independently of assumption (8). Proof. By denition, for k 1 D (m+k) = Here = (b m 1,..., 0, 0), so D (m) = D(m 1) t 0 a m. 0 ( ) D (m 1) t. a m An application of theorem in Parlett (1998) gives the desired result. Lemma 11. If D (m) is unreduced, the eigenvalues of D (m) and the eigenvalues of D (m 1) are distinct. Proof. Suppose the two matrices have a common eigenvalue λ. It follows from (11) and the fact that D (m) is unreduced that λ is an eigenvalue of D (m 2). Repeating this, we deduce that a 1 is an eigenvalue of D (2), a contradiction, as 0 = χ (2) (a 1 ) = b 2 1 < 0. Remark 12. In general it is not true that D (m) and a submatrix D (k) have distinct eigenvalues. Consider the case where a i = c for all i. Using equation (11) we conclude that c is an eigenvalue for all submatrices with m odd. Proposition 13. If dim K (m) = m, we have det ( D (m 1)) 0.

11 10 Proof. The matrix D (m) is positive semidenite, hence all eigenvalues of D (m) are 0. In other words, det ( D (m 1)) 0 if and only if its smallest eigenvalue µ (m 1) m 1 is > 0. Using Theorem 10 we have µ (m) m µ (m 1) m 1 0. As dim K (m) = m, the matrix D (m) is unreduced, which implies that D (m) and D (m 1) have no common eigenvalues (see 11). We can therefore replace the rst by >, i.e. the smallest eigenvalue of D (m 1) is > 0. It is well known that the matrices D (m) are closely related to the so-called Rayleigh-Ritz procedure, a method that is used to approximate eigenvalues. For details consult e.g. Parlett (1998). 5 What is shrinkage? We presented two estimators for the regression parameter β OLS and PLS which also dene estimators for Xβ via ŷ = X β. One possibility to evaluate the quality of an estimator is to determine its Mean Squared Error (MSE). In general, the MSE of an estimator θ for a vector-valued parameter θ is dened as MSE ( θ ) = E [trace ( θ ) ) ] t θ ( θ θ [ ) t ) = E ( θ θ ( θ ] θ = ( ] ) t ( ] ) [ [ θ]) t [ θ]) E [ θ θ E [ θ θ + E ( θt E ( θt ] E. This is the well-known bias-variance decomposition of the MSE. The rst part is the squared bias and the second part is the variance term. We start by investigating the class of linear estimators, i.e. estimators that are of the form θ = Sy for a matrix S that does not depend on y. It follows immediately from the regression model (1) and (2) that for a linear estimator, ] E [ θ = SXβ, var [ θ ] = σ 2 trace (SS t ). The OLS estimators are linear: β OLS = ( X t X ) X t y, ŷ OLS = X ( X t X ) X t y.

12 11 Note that the estimator of ŷ OLS is simply the projection P X onto the space that is spanned by the columns of X. The estimator ŷ OLS is unbiased as E [ŷ OLS ] = P X Xβ = Xβ. The estimator β OLS is only unbiased if β range (X t X) : ] [ (X E [ βols = E t X ) ] X t y = ( X t X ) X t E [y] = ( X t X ) X t Xβ = β. Let us now have a closer look at the variance term. It follows directly from trace(p X P t X ) = rk(x) = p that For β OLS we have var (ŷ OLS ) = σ 2 p. ( X t X ) X t ( ( X t X ) X t ) t = ( X t X ) = UΛ U t, hence var ( βols ) p = σ 2 1. (13) λ i We conclude that the MSE of the estimator β OLS depends on the eigenvalues λ 1,..., λ p of A = X t X. Small eigenvalues of A correspond to directions in X that have a low variance. Equation (13) shows that if some eigenvalues are small, the variance of β OLS is very high, which leads to a high MSE. One possibility to (hopefully) decrease the MSE is to modify the OLS estimator by shrinking the directions of the OLS estimator that are responsible for a high variance. This of course introduces bias. We shrink the OLS estimator in the hope that the increase in bias is small compared to the decrease in variance. In general, a shrinkage estimator for β is of the form β shr = p f(λ i )z i, where f is some real-valued function. The values f(λ i ) are called shrinkage factors. Examples are

13 12 Principal Component Regression { 1 ith principal component included f(λ i ) = 0 otherwise and Ridge Regression f(λ i ) = with λ > 0 the Ridge parameter. λ i λ i + λ We illustrate in section 6 that PLS is a shrinkage estimator as well. It turns out that the shrinkage behavior of PLS regression is rather complicated. Let us investigate in which way the MSE of the estimator is inuenced by the shrinkage factors. If the shrinkage estimators are linear, i.e. the shrinkage factors do not depend on y, this is an easy task. Let us rst write the shrinkage estimator in matrix notation. We have β shr = S shr y = UΣ D shr V t y. The diagonal matrix D shr has entries f(λ i ). The shrinkage estimator for y is ŷ shr = XS shr y = V ΣΣ D shr V t. We calculate the variance of these estimators. trace ( S shr Sshr t ) = trace ( UΣ D shr Σ D shr U t) = trace ( Σ D shr Σ ) D shr and = p (f (λ i )) 2 trace ( XS shr S t shrx t) = trace ( V ΣΣ D shr ΣΣ D shr V t) = trace ( ΣΣ D shr ΣΣ D shr ) = p λ i (f (λ i )) 2. Next, we calculate the bias of the two shrinkage estimators. We have E [S shr y] = S shr Xβ = UΣD shr Σ U t β.

14 13 It follows that bias 2 ( βshr ) = (E [S shr y] β) t (E [S shr y] β) = ( U t β ) t ( ΣDshr Σ I p ) t ( ΣDshr Σ I p ) ( U t β ) = p (f(λ i ) 1) 2 ( u t iβ ) 2. Replacing S shr by XS shr it is easy to show that bias 2 (ŷ shr ) = p λ i (f(λ i ) 1) 2 ( u t iβ ) 2. Proposition 14. For the shrinkage estimator β shr and ŷ shr dened above we have ) MSE ( βshr = MSE (ŷ shr ) = p p (f(λ i ) 1) 2 ( u t iβ ) 2 + σ 2 p λ i (f(λ i ) 1) 2 ( u t iβ ) 2 + σ 2 (f (λ i )) 2 p λ i (f (λ i )) 2. If the shrinkage factors are deterministic, i.e. they do not depend on y, any value f(λ i ) 1 increases the bias. Values f(λ i ) < 1 decrease the variance, whereas values f(λ i ) > 1 increase the variance. Hence an absolute value > 1 is always undesirable. The situation might be dierent for stochastic shrinkage factors. We discuss this in the following section. Note that there is a dierent notion of shrinkage, namely that the L 2 - norm of an estimator is smaller than the L 2 -norm of the OLS estimator. Why is this a desirable property? Let us again consider the case of linear estimators. Set θ i = S i y for i = 1, 2. We have 2 θ i = y t SiS t i y. 2 The property that for all y R n θ 1 2 θ 2 2 is equivalent to the condition that S1S t 1 S2S t 2 is negative semidenite. The trace of negative semidenite matrices is 0. Furthermore trace (Si ts i) = trace (S i Si t ), so we conclude that var ( θ1 ) var ( θ2 ).

15 14 It is known (de Jong (1995)) that β (1) P LS 2 β (2) P LS 2... β (m ) P LS 2 = β OLS 2. 6 The shrinkage factors of PLS In this section, we give a simpler and clearer proof of the shape of the shrinkage factors of PLS. Basically, we combine the results of Butler & Denham (2000) and Phatak & de Hoog (2002). In the rest of the section, we assume that m < m (m) as the shrinkage factors for β P LS = β OLS are trivial, i.e. f (m ) (λ i ) = 1. β (m) By denition of the PLS estimator, P LS K (m). Hence there is a polynomial π of degree m 1 with β P LS = π(a)b. Recall that the eigenvalues (m) of D (m) are denoted by µ (m) i. Set ( ) m f (m) (λ) := 1 1 λ µ (m) i 1 = 1 χ (m) (0) χ(m) (λ). As f (m) (0) = 0, there is a polynomial π (m) of degree m 1 such that f (m) (λ) = λ π (m) (λ). (14) Proposition 15 ((Phatak & de Hoog 2002)). Suppose that m < m. We have β (m) P LS = π (m) (A) b. Proof (Phatak & de Hoog 2002). Using either equation (14) or the Caley- Hamilton theorem (recall proposition 13), it is easy to prove that (D (m)) 1 = π (m) ( D (m)). We plug this into equation (7) and obtain β (m) P LS = W (m) π (m) ( ( W (m)) t AW (m)) ( W (m)) t b. Recall that the columns of W (m) form an orthonormal basis of K (m). It follows that W (m) ( W (m))t is the operator that projects on the space K (m). In particular W (m) ( W (m)) t A j b = A j b

16 15 for j = 1,..., m 1. This implies that β (m) P LS = π (m) (A)b. Using (14), we can immediately conclude the following corollary. Corollary 16 ((Phatak & de Hoog 2002)). Suppose that dim K (m) = m. If we denote by z i the component of β OLS along the ith eigenvector of A then β (m) P LS = p f (m) (λ i ) z i, with f (m) (λ) dened in (14). We now show that some of the shrinkage factors of PLS are 1. Theorem 17 ((Butler & Denham 2000)). For each m m 1, we can decompose the interval [λ p, λ i ] into m + 1 disjoint intervals 1 I 1 I 2... I m+1 such that f (m) (λ i ) { 1 λ i I j and j odd 1 λ i I j and j even. Proof. Set g (m) (λ) = 1 f (m) (λ). It follows from equation (14) that the zero's of g (m) (λ) are µ (m) m,..., µ (m) 1. As D (m) is unreduced, all eigenvalues are distinct. Set µ (m) 0 = λ 1 and µ (m) m+1 = λ p. Dene I j =]µ (m) i, µ (m) i+1 [ for j = 0,..., m. By denition, g (m) (0) = 1. Hence g (m) (λ) is non-negative on the intervals I j if j is odd and g (m) is non-positive on the intervals I j if j is even. It follows from theorem 10 that all intervals I j contain at least one eigenvalue λ i of A. In general it is not true that f (m) (λ i ) 1 for all λ i and m = 1,..., m. Using the example in remark 12 and the fact that f (m) (λ i ) = 1 is equivalent to the condition that λ i is an eigenvalue of D (m), it is easy to construct a counterexample. Using some of the results of section 4, we 1 We say that I j I k if sup I j inf I k.

17 16 can however deduce that some factors are indeed 1. As all eigenvalues of D (m 1) and D (m ) are distinct (proposition 11), we see that f (m 1) (λ i ) 1 for all i. In particular { f (m 1) < 1 m even (λ 1 ) > 1 m odd. More generally, using proposition 11, we conclude that f (m 1) (λ i ) = 1 and f (m) (λ i ) = 1 is not possible. In practice i.e. calculated on a data set the shrinkage factors seem to be 1 all of the time. Furthermore 0 f (m) (λ p ) < 1. To prove this, we set g (m) (λ) = 1 f (m) (λ). We have g (m) (0) = 1. Furthermore, the smallest positive zero of g (m) (λ) is µ (m) m and it follows from theorem 10 and proposition 11 that λ p < µ (m) m. Hence g (m) (λ p ) ]0, 1]. Using theorem 10, more precisely it is possible to bound the terms λ p µ (m) i λ i, 1 λ i. µ (m) i From this we can derive bounds on the shrinkage factors. We do not pursue this further. Readers who are interested in the bounds should consult Lingjaerde & Christopherson (2000). Instead, we have a closer look at the MSE of the PLS estimator. In section 5, we showed that a value f (m) (λ i ) > 1 is not desirable, as both the bias and the variance of the estimator increases. Note however that in the case of PLS, the factors f (m) (λ i ) are stochastic; they depend on y in a nonlinear way. The variance of the PLS estimator for the ith principal component is ( ) var f (m) (λ i ) (v i) t y λi with both f (m) (λ i ) and vt i y λi depending on y.

18 17 Among others, Frank & Friedman (1993) propose to truncate the shrinkage factors of the PLS estimator in the following way. Set +1 f (m) (λ i ) > +1 f (m) (λ i ) = 1 f (m) (λ i ) < 1 f (m) (λ i ) otherwise and dene a new estimator: β (m) T RN := p f (m) (λ i )z i. (15) If the shrinkage factors are numbers, this will improve the MSE (cf. section 5). But in the case of stochastic shrinkage factors, the situation might be dierent. Let us suppose for a moment that f (m) (λ i ) = λ i. It follows that vi t y ( 0 = var f (m) (λ i ) vt i y ) ( var f (m) (λ i ) vt i y ), λi λi so it is not clear whether the truncated estimator TRN leads to a lower MSE, which is conjectured e.g. in Frank & Friedman (1993). The assumption that f (m) (λ i ) = λ i v t i y is of course purely hypothetical. It is not clear whether the shrinkage factors behave this way. It is hard if not infeasible to derive statistical properties of the PLS estimator or its shrinkage factors, as they depend on y in a complicated, nonlinear way. As an alternative, we compare the two dierent estimators on dierent data sets. 7 Experiments In this section, we explore the dierence between the methods PLS and TRN. We investigate several articial datasets and one real world example. Simulation We compare the MSE of the two methods - PLS and truncated PLS - on 27 dierent articial data sets. We use a setting similar to the one in Frank & Friedman (1993). For each data set, the number of examples is n = 50. We consider three dierent number of predictor variables: p = 5, 40, 100.

19 18 The input data X is chosen according to a multivariate normal distribution with zero mean and covariance matrix C. We consider three dierent covariance matrices: C 1 = I p, 1 (C 2 ) ij = i j + 1, { 1, i = j (C 3 ) ij = 0.7, i j. The matrices C 1, C 2 and C 3 correspond to no, moderate and high collinearity respectively. The regression vector β is a randomly chosen vector β {0, 1} p. In addition, we consider three dierent signal-to-noise ratios: var (Xβ) stnr = σ 2 = 1, 3, 7. We yield = 27 dierent parameter settings. For each setting, we estimate the MSE of the two methods: For k = 1,..., K = 200, we generate y according to (1) and (2). We determine for each method and each m the respective estimator β k and dene MSE( β) = 1 K K k=1 ) t ( βk β ( βk β). If there are more predictor variables than examples, this approach is not sensible, as the true regression vector β is not identiable. This implies that dierent regressions vectors β 1 β 2 can lead to Xβ 1 = Xβ 2. Hence for p = 100, we estimate the MSE of ŷ for the two methods. We display the estimated MSE of the method TRN as a fraction of the estimated MSE of the method PLS, i.e. for each m we display ( ) MSE β (m) T RN MSE RAT IO = ( MSE β (m) P LS ). (As already mentioned, we display the MSE-RATIO for ŷ in the case p = 100.) The results are displayed in gures 1, 2 and 3. In order to have a compact representation, we consider the averaged MSE-RATIOS for dierent parameter settings. E.g. we x a degree of collinearity (say high collinearity) and display the averaged MSE-RATIO over the three dierent signal-to-noise ratios. The results for all 27 data sets are shown in the tables in the appendix.

20 Figure 1: MSE-RATIO for p = 5. The gures show the averaged MSE- RATIO for dierent parameter settings. Left: Comparison for high (straight line), moderate (dotted line) and no (dashed line) collinearity. Right: Comparison for stnr 1 (straight line), 3 (dotted line) and 7 (dashed line) Figure 2: MSE-RATIO for p = 40. There are several observations: The MSE of TRN is lower almost all of the times. The decrease of the MSE is particularly large if the number of components m is small, but > 1. For larger m, the dierence decreases. This is not surprising, as for large m, the dierence between the PLS estimator and the OLS estimator decreases. Hence we expect the dierence between TRN and PLS to become smaller. The reduction of the MSE is particularly prominent in complex situations, i.e. in situations with collinearity in X or with a low signal-to-noise-ratio. Another feature which cannot be deduced from gures 1, 2 and 3 but from the

21 Figure 3: MSE-RATIO for p = 100. In this case, we display the MSE-RATIO for ŷ instead of β. Only the rst 20 components are displayed. tables in the appendix is the fact that the optimal numbers of components ( ) m opt P LS = argmin M SE β (m) P LS ( ) m opt T RN = argmin MSE β (m) T RN are equal almost all of the times. This is also true if we consider the MSE of ŷ. We can benet from this if we want to select an optimal model for truncated PLS. We return to this subject in section 8. Real world data In this example, we consider the near infrared spectra (NIR) of n = 171 meat samples that are measured at p = 100 dierent wavelengths from nm. This data set is taken from the StatLib datasets archive and can be downloaded from The task is to predict the fat content of a meat sample on the basis of its NIR spectrum. We choose this dataset as PLS is widely used in the chemometrics eld. In this type of applications, we usually observe a lot of predictor variables which are highly correlated. We estimate the MSE of the two methods PLS and truncated PLS by computing the 10fold cross-validated error of the two estimators. The results are displayed in gure 4. Again, TRN is better allmost all of the times, although the dierence is small. Note furthermore that the optimal number of components are almost identical for the two methods: We have m opt P LS = 15 and mopt T RN = 16.

22 Figure 4: 10fold cross-validated test error for the Tecator data set. The straight line corresponds to PLS, the dashed line corresponds to truncated PLS. 8 Discussion We saw in section 7 that bounding the absolute value of the PLS shrinkage factors by one seems to improve the MSE of the estimator. So should we now discard PLS and always use TRN instead? There might be (at least) two objections. Firstly, it would be somewhat lightheaded if we relied on results of a small-scale simulation study. Secondly, TRN is computationally more extensive than PLS. We need the full singular value decomposition of X. In each step, we have to compute the PLS estimator and adjust its shrinkage factors by hand. However, the experiments suggest that it can be worthwhile to compare PLS and truncated PLS. We pointed out in section 7 that the two methods do not seem to dier much in terms of optimal number of components. In order to reduce the computational costs of truncated PLS, we therefore suggest the following strategy. We rst compute the optimal PLS model on a training set and choose the optimal model with the help of a model selection criterion. In a second step, we truncate the shrinkage factors of the optimal model. We then use a validation set in order to quantify the dierence between PLS and TRN and choose the method with the lower validation error. References Butler, N. & Denham, M. (2000), `The Peculiar Shrinkage Properties of Partial Least Squares Regression', Journal of the Royal Statistical Society Se-

23 22 ries B 62 (3), de Jong, S. (1995), `PLS shrinks', Journal of Chemometrics 9, Frank, I. & Friedman, J. (1993), `A Statistical View of some Chemometrics Regression Tools', Technometrics 35, Helland, I. (1988), `On the Structure of Partial Least Squares Regression', Communications in Statistics, Simulation and Computation 17(2), Höskuldsson, A. (1988), `PLS Regression Methods', Journal of Chemometrics 2, Lanczos, C. (1950), `An Iteration Method for the Solution of the Eigenvalue Problem of Linear Dierential and Integral Operators', Journal of Research of the National Bureau of Standards 45, Lingjaerde, O. & Christopherson, N. (2000), `Shrinkage Structures of Partial Least Squares', Scandinavian Journal of Statistics 27, Parlett, B. (1998), The Symmetric Eigenvalue Problem, Society for Industrial and Applied Mathematics. Phatak, A. & de Hoog, F. (2002), `Exploiting the Connection between PLS, Lanczos, and Conjugate Gradients: Alternative Proofs of some Properties of PLS', Journal of Chemometrics 16, Rosipal, R. & Krämer, N. (2006), Overview and Recent Advances in Partial Least Squares, in `Subspace, Latent Structure and Feature Selection Techniques', Lecture Notes in Computer Science, Springer, Wold, H. (1975), Path models with Latent Variables: The NIPALS Approach, in H. B. et al., ed., `Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building', Academic Press,

24 23 A Appendix: Results of the simulation study We display the results of the simulation study that is described in section 7. The following tables show the MSE-RATIO for β as well as for ŷ. In addition to the MSE ratio, we display the optimal number of components for each method. It is interesting to see that the two quantities are the same almost all of the times. collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 1: MSE-RATIO of β for p = 5. The rst two rows display the setting of the parameters. The rows entitled 1-4 display the MSE ratio for the respective number of components. collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 2: MSE-RATIO of ŷ for p = 5.

25 24 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 3: MSE-RATIO of β for p = 40.

26 25 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 4: MSE-RATIO of ŷ for p = 40.

27 26 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 5: MSE-RATIO of ŷ for p = 100. We only display the results for the rst 20 components, as the MSE-RATIO equals 1 (up to 4 digits after the decimal point) for the remaining components.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,