An Overview on the Shrinkage Properties of Partial Least Squares Regression

Size: px
Start display at page:

Download "An Overview on the Shrinkage Properties of Partial Least Squares Regression"

Transcription

1 An Overview on the Shrinkage Properties of Partial Least Squares Regression Nicole Krämer 1 1 TU Berlin, Department of Computer Science and Electrical Engineering, Franklinstr. 28/29, D Berlin Summary The aim of this paper is twofold. In the rst part, we recapitulate the main results regarding the shrinkage properties of Partial Least Squares (PLS) regression. In particular, we give an alternative proof of the shape of the PLS shrinkage factors. It is well known that some of the factors are > 1. We discuss in detail the eect of shrinkage factors for the Mean Squared Error of linear estimators and argue that we cannot extend the results to PLS directly, as it is nonlinear. In the second part, we investigate the eect of shrinkage factors empirically. In particular, we point out that experiments on simulated and real world data show that bounding the absolute value of the PLS shrinkage factors by 1 seems to leads to a lower Mean Squared Error. Keywords: linear regression, biased estimators, mean squared error

2 1 1 Introduction In this paper, we want to give a detailed overview on the shrinkage properties of Partial Least Squares (PLS) regression. It is well known (Frank & Friedman 1993) that we can express the PLS estimator obtained after m steps in the following way: β (m) P LS = i f (m) (λ i )z i, where z i is the component of the Ordinary Least Squares (OLS) estimator along the ith principal component of the covariance matrix X t X and λ i is the corresponding eigenvalue. The quantities f (m) (λ i ) are called shrinkage factors. We show that these factors are determined by a tridiagonal matrix (which depends on the inputoutput matrix (X, y)) and can be calculated in a recursive way. Combining the results of Butler & Denham (2000) and Phatak & de Hoog (2002), we give a simpler and clearer proof of the shape of the shrinkage factors of PLS and derive some of their properties. In particular, we reproduce the fact that some of the values f (m) (λ i ) are greater than 1. This was rst proved by Butler & Denham (2000). We argue that these peculiar shrinkage properties (Butler & Denham 2000) do not necessarily imply that the Mean Squared Error (MSE) of the PLS estimator is worse compared to the MSE of the OLS estimator. In the case of deterministic shrinkage factors, i.e. factors that do not depend on the output y, any value f (m) (λ i ) > 1 is of course undesirable. But in the case of PLS, the shrinkage factors are stochastic they also depend on y. In particular, bounding the absolute value of the shrinkage factor by 1 might not automatically yield a lower MSE, in disagreement to what was conjectured in e.g. Frank & Friedman (1993). Having issued this warning, we explore whether bounding the shrinkage factors leads to a lower MSE or not. It is very dicult to derive theoretical (m) results, as the quantities of interest - β P LS and f (m) (λ i ) respectively - depend on y in a complicated, nonlinear way. As a substitute, we study the problem on several articial data sets and one real world example. It turns out that in most cases, the MSE of the bounded version of PLS is indeed smaller than the one of PLS.

3 2 2 Preliminaries We consider the multivariate linear regression model with y = Xβ + ɛ (1) cov (y) = σ 2 I n. (2) Here, I n is the identity matrix of dimension n. The numbers of variables is p, the number of examples is n. For simplicity, we assume that X and y are scaled to have zero mean, so we do not have to worry about intercepts. We have X R n p, A := X t X cov(x) R p p, y R n, b := X t y cov(x, y) R p. We set p = rk (A) = rk (X). The singular value decomposition of X is of the form with X = V ΣU t V R n p, Σ = diag (σ 1,..., σ p ) R p p, U R p p. The columns of U and V are mutually orthogonal, that is we have U t U = I p and V t V = I p. We set λ i = σ 2 i and Λ = Σ 2. The eigendecomposition of A is A = UΛU t = p λ i u i u t i. The eigenvalues λ i of A (and any other matrix) are ordered in the following way: λ 1 λ 2... λ p 0. The Moore-Penrose inverse of a matrix M is denoted by M.

4 3 The Ordinary Least Squares (OLS) estimator β OLS is the solution of the optimization problem arg min β y Xβ. If there is no unique solution (which is in general the case for n > p), the OLS estimator is the solution with minimal norm. Set The OLS estimator is given by the formula We dene This implies s = ΣV t y. (3) β OLS = ( X t X ) X t y = UΛ ΣV t y = UΛ s = z i = vt i y λi u i. p v t i y λi u i. Set β OLS = p z i. K (m) := ( A 0 b, Ab,..., A m 1 b ) R p m. The columns of K (m) are called the Krylov sequence of A and b. The space spanned by the columns of K (m) is called the Krylov space of A and b and denoted by K (m). Krylov spaces are closely related to the Lanczos algorithm (Lanczos 1950), a method for approximating eigenvalues of the matrix A. We exploit the relationship between PLS and Krylov spaces in the subsequent sections. An excellent overview of the connections of PLS to the Lanczos method (and the conjugate gradient algorithm) can be found in Phatak & de Hoog (2002). Set M := {λ i s i 0} = { λ i 0 v t iy 0 } (the vector s is dened in (3)) and m := M. It follows easily that m p = rk(x). The inequality is strict if A has non-zero eigenvalues of multiplicity > 1 or if there is a principal component v i that is not correlated to y, i.e. v t i y = 0. The quantity m is also called the grade of b with respect to A. We state a standard result on the dimension of the Krylov spaces associated to A and b.

5 4 Proposition 1. We have dim K (m) = { m m m m m > m. In particular dim K (m ) = dim K (m +1) =... = dim K (p) = m. (4) Finally, let us introduce the following notation. For any set S of vectors, we denote by P S the projection onto the space that is spanned by S. It follows that P S = S ( S t S ) S t. 3 Partial Least Squares We only give a sketchy introduction to the PLS method. More details can be found e.g. in Höskuldsson (1988) or Rosipal & Krämer (2006). The main idea is to extract m orthogonal components from the predictor space X and t the response y to these components. In this sense, PLS is similar to Principal Components Regression (PCR). The dierence is that PCR extracts components that explain the variance in the predictor space whereas PLS extracts components that have a high covariance with y. The quantity m is called the number of PLS steps or the number of PLS components. We now formalize this idea. The rst latent component t 1 is a linear combination t 1 = Xw 1 of the predictor variables. The vector w 1 is usually called the weight vector. We want to nd a component with maximal covariance to y, that is we want to compute w 1 = arg max cov (Xw, y) = arg max w =1 w =1 wt b. (5) Using Lagrangian multipliers, we conclude that the solution w 1 is up to a factor equal to X t y = b. Subsequent components t 2, t 3,... are chosen such that they maximize (5) and that all components are mutually orthogonal. We ensure orthogonality by deating the original predictor variables X. That is, we only consider the part of X that is orthogonal on all components t j for j < i: X i = X P t1,...,t i 1 X. We then replace X by X i in (5). This version of PLS is called the NIPALS algorithm (Wold 1975).

6 5 Algorithm 2 (NIPALS algorithm). After setting X 1 = X, the weight vectors w i and the components t i of PLS are determined by iteratively computing The nal estimator ŷ for Xβ is w i = Xi T Y weight vector t i = X i w i component X i+1 = X i P ti X i deation ŷ = P t1,...,t m y = m P tj y. The last equality follows as the PLS components t 1,..., t m are mutually orthogonal. We denote by W (m) the matrix that consists of the weight vectors w 1,..., w m dened in algorithm 2: j=1 W (m) = (w 1,..., w m ). (6) The PLS components t j and the weight vectors are linked in the following way (see e.g. Höskuldsson (1988)): (t 1,..., t m ) = XW (m) R (m) with an invertible bidiagonal matrix R (m). Plugging this into (3), we yield ŷ = XW (m) ( ( W (m)) t X t XW (m) ) ( W (m)) t X t y. It can be shown (Helland 1988) that the space spanned by the vectors w i (i = 1,..., m) equals the Krylov space K (m) that is dened in section 2. More precisely, W (m) is an orthogonal basis of K (m) that is obtained by a Gram-Schmidt procedure. This implies: Proposition 3 ((Helland 1988)). The PLS estimator obtained after m steps can be expressed in the following way: β (m) P LS = K (m) [ ( K (m)) t AK (m)] ( K (m)) t b. (7) An equivalent expression is that the P LS estimator of the constrained minimization problem arg min β y Xβ subject to β K (m). β (m) P LS for β is the solution

7 6 Of course, the orthogonal basis W (m) of the Krylov space only exists if dim K (m) = m, which might not be true for all m p. The maximal number for which this holds is m (see proposition 1). Note however that it follows from (4) that K (m 1) K (m ) = K (m +1) =... = K (p), and the solution of the optimization problem does not change anymore. Hence there is no loss in generality if we make the assumption that Remark 4. We have dim K (m) = m. (8) β (m ) P LS = β OLS. Proof. This result is well-known and it is usually proven using the fact that after the maximal number of steps the vectors t 1,..., t m span the same space as the columns of X. Here we present an algebraic proof that exploits the relationship between PLS and Krylov spaces. We show that the OLS estimator is an element of K (m ), that is β OLS = π OLS (A)b for a polynomial π OLS of degree m 1. We dene this polynomial via the m equations In matrix notation, this equals Using (2), we conclude that π OLS (λ i ) = 1 λ i, λ i M. π OLS (Λ)s = Λ s. (9) π OLS (A)b = Uπ OLS (Λ)ΣV t y = Uπ OLS (Λ)s (9) = UΛ s = β OLS. Set D (m) = ( W (m)) t AW (m) R m m. Proposition 5. The matrix D (m) is symmetric and positive semidenite. Furthermore D (m) is tridiagonal: d ij = 0 for i j 2. Proof. The rst two statements are obvious. Let i j 2. As w i K (i), the vector Aw i lies in the subspace K (i+1). As j > i + 1, the vector w j is orthogonal on K (i+1), in other words d ji = w j, Aw i = 0. As D (m) is symmetric, we also have d ij = 0 which proves the assertion.

8 4 Tridiagonal matrices 7 We see in section 6 that the matrices D (m) and their eigenvalues determine the shrinkage factors of the PLS estimator. To prove this, we now list some properties of D (m). Denition 6. A symmetric tridiagonal matrix D is called unreduced if all subdiagonal entries are non-zero, i.e. d i,i+1 0 for all i. Theorem 7 ((Parlett 1998)). All eigenvalues of an unreduced matrix are distinct. Set D (m) = a 1 b b 1 a 2 b a m 1 b m b m 1 a m Proposition 8. If dim K (m) = m, the matrix D (m) is unreduced. More precisely b i > 0 for all i {1,..., m 1}. Proof. Set p i = A i 1 b and denote by w 1,..., w m the basis (6) obtained by Gram-Schmidt. Its existence is guaranteed as we assume that dim K (m) = m. We have to show that b i = w i, Aw i 1 > 0. As the length of w i does not change the sign of b i, we can assume that the vectors w i are not normalized to have length 1. By denition i 1 p i, w k w i = p i w k, w k w k. (10) k=1 As the vectors w i are pairwise orthogonal, it follows that w i, p i = p i, p i > 0.

9 8 We conclude that b i = w i, Aw i 1 ( (10) i 2 = w i, A p i 1 i 2 = wi, p i Ap i 1 =p i (5) = w i, p i (11) = p i, p i > 0 k=1 k 1 ) p i 1, w k w k, w k w k p i 1, w k w k, w k w i, Aw k Note that the matrix D (m 1) is obtained from D (m) by deleting the last column and row of D (m). It follows that we can give a recursive formula for the characteristical polynomials of D (m). We have χ (m) := χ D (m) χ (m) (λ) = (a m λ) χ (m 1) (λ) b 2 m 1χ (m 2) (λ). (11) We want to deduce properties of the eigenvalues of D (m) and A and explore their relationship. Denote the eigenvalues of D (m) by µ (m) 1 >... > µ (m) m 0. (12) Remark 9. All eigenvalues of D (m ) are eigenvalues of A. Proof. First note that A K (m ) : K (m ) K (m +1) = K (m ). As the columns of the matrix W (m ) form an orthonormal basis of K (m ), D (m ) = ( W (m ) ) t AW (m ) is the matrix that represents A K (m ) with respect to this basis. As any eigenvalue of A K (m ) is obviously an eigenvalue of A, the proof is complete.

10 9 The following theorem is a special form of the Cauchy Interlace Theorem. In this version, we use a general result from Parlett (1998) and exploit the tridiagonal structure of D (m). Theorem 10. Each interval [ ] µ (m) m j, µ(m) m (j+1) (j = 0,..., m 2) contains a dierent eigenvalue of D (m+k) ) (k 1). In addition, there is a dierent eigenvalue of D (m+k) outside the open interval (µ (m) m, µ (m) 1 ). This theorems ensures in particular that there is a dierent eigenvalue of A in the interval [ µ (m) k, µ k 1] (m). Theorem 10 holds independently of assumption (8). Proof. By denition, for k 1 D (m+k) = Here = (b m 1,..., 0, 0), so D (m) = D(m 1) t 0 a m. 0 ( ) D (m 1) t. a m An application of theorem in Parlett (1998) gives the desired result. Lemma 11. If D (m) is unreduced, the eigenvalues of D (m) and the eigenvalues of D (m 1) are distinct. Proof. Suppose the two matrices have a common eigenvalue λ. It follows from (11) and the fact that D (m) is unreduced that λ is an eigenvalue of D (m 2). Repeating this, we deduce that a 1 is an eigenvalue of D (2), a contradiction, as 0 = χ (2) (a 1 ) = b 2 1 < 0. Remark 12. In general it is not true that D (m) and a submatrix D (k) have distinct eigenvalues. Consider the case where a i = c for all i. Using equation (11) we conclude that c is an eigenvalue for all submatrices with m odd. Proposition 13. If dim K (m) = m, we have det ( D (m 1)) 0.

11 10 Proof. The matrix D (m) is positive semidenite, hence all eigenvalues of D (m) are 0. In other words, det ( D (m 1)) 0 if and only if its smallest eigenvalue µ (m 1) m 1 is > 0. Using Theorem 10 we have µ (m) m µ (m 1) m 1 0. As dim K (m) = m, the matrix D (m) is unreduced, which implies that D (m) and D (m 1) have no common eigenvalues (see 11). We can therefore replace the rst by >, i.e. the smallest eigenvalue of D (m 1) is > 0. It is well known that the matrices D (m) are closely related to the so-called Rayleigh-Ritz procedure, a method that is used to approximate eigenvalues. For details consult e.g. Parlett (1998). 5 What is shrinkage? We presented two estimators for the regression parameter β OLS and PLS which also dene estimators for Xβ via ŷ = X β. One possibility to evaluate the quality of an estimator is to determine its Mean Squared Error (MSE). In general, the MSE of an estimator θ for a vector-valued parameter θ is dened as MSE ( θ ) = E [trace ( θ ) ) ] t θ ( θ θ [ ) t ) = E ( θ θ ( θ ] θ = ( ] ) t ( ] ) [ [ θ]) t [ θ]) E [ θ θ E [ θ θ + E ( θt E ( θt ] E. This is the well-known bias-variance decomposition of the MSE. The rst part is the squared bias and the second part is the variance term. We start by investigating the class of linear estimators, i.e. estimators that are of the form θ = Sy for a matrix S that does not depend on y. It follows immediately from the regression model (1) and (2) that for a linear estimator, ] E [ θ = SXβ, var [ θ ] = σ 2 trace (SS t ). The OLS estimators are linear: β OLS = ( X t X ) X t y, ŷ OLS = X ( X t X ) X t y.

12 11 Note that the estimator of ŷ OLS is simply the projection P X onto the space that is spanned by the columns of X. The estimator ŷ OLS is unbiased as E [ŷ OLS ] = P X Xβ = Xβ. The estimator β OLS is only unbiased if β range (X t X) : ] [ (X E [ βols = E t X ) ] X t y = ( X t X ) X t E [y] = ( X t X ) X t Xβ = β. Let us now have a closer look at the variance term. It follows directly from trace(p X P t X ) = rk(x) = p that For β OLS we have var (ŷ OLS ) = σ 2 p. ( X t X ) X t ( ( X t X ) X t ) t = ( X t X ) = UΛ U t, hence var ( βols ) p = σ 2 1. (13) λ i We conclude that the MSE of the estimator β OLS depends on the eigenvalues λ 1,..., λ p of A = X t X. Small eigenvalues of A correspond to directions in X that have a low variance. Equation (13) shows that if some eigenvalues are small, the variance of β OLS is very high, which leads to a high MSE. One possibility to (hopefully) decrease the MSE is to modify the OLS estimator by shrinking the directions of the OLS estimator that are responsible for a high variance. This of course introduces bias. We shrink the OLS estimator in the hope that the increase in bias is small compared to the decrease in variance. In general, a shrinkage estimator for β is of the form β shr = p f(λ i )z i, where f is some real-valued function. The values f(λ i ) are called shrinkage factors. Examples are

13 12 Principal Component Regression { 1 ith principal component included f(λ i ) = 0 otherwise and Ridge Regression f(λ i ) = with λ > 0 the Ridge parameter. λ i λ i + λ We illustrate in section 6 that PLS is a shrinkage estimator as well. It turns out that the shrinkage behavior of PLS regression is rather complicated. Let us investigate in which way the MSE of the estimator is inuenced by the shrinkage factors. If the shrinkage estimators are linear, i.e. the shrinkage factors do not depend on y, this is an easy task. Let us rst write the shrinkage estimator in matrix notation. We have β shr = S shr y = UΣ D shr V t y. The diagonal matrix D shr has entries f(λ i ). The shrinkage estimator for y is ŷ shr = XS shr y = V ΣΣ D shr V t. We calculate the variance of these estimators. trace ( S shr Sshr t ) = trace ( UΣ D shr Σ D shr U t) = trace ( Σ D shr Σ ) D shr and = p (f (λ i )) 2 trace ( XS shr S t shrx t) = trace ( V ΣΣ D shr ΣΣ D shr V t) = trace ( ΣΣ D shr ΣΣ D shr ) = p λ i (f (λ i )) 2. Next, we calculate the bias of the two shrinkage estimators. We have E [S shr y] = S shr Xβ = UΣD shr Σ U t β.

14 13 It follows that bias 2 ( βshr ) = (E [S shr y] β) t (E [S shr y] β) = ( U t β ) t ( ΣDshr Σ I p ) t ( ΣDshr Σ I p ) ( U t β ) = p (f(λ i ) 1) 2 ( u t iβ ) 2. Replacing S shr by XS shr it is easy to show that bias 2 (ŷ shr ) = p λ i (f(λ i ) 1) 2 ( u t iβ ) 2. Proposition 14. For the shrinkage estimator β shr and ŷ shr dened above we have ) MSE ( βshr = MSE (ŷ shr ) = p p (f(λ i ) 1) 2 ( u t iβ ) 2 + σ 2 p λ i (f(λ i ) 1) 2 ( u t iβ ) 2 + σ 2 (f (λ i )) 2 p λ i (f (λ i )) 2. If the shrinkage factors are deterministic, i.e. they do not depend on y, any value f(λ i ) 1 increases the bias. Values f(λ i ) < 1 decrease the variance, whereas values f(λ i ) > 1 increase the variance. Hence an absolute value > 1 is always undesirable. The situation might be dierent for stochastic shrinkage factors. We discuss this in the following section. Note that there is a dierent notion of shrinkage, namely that the L 2 - norm of an estimator is smaller than the L 2 -norm of the OLS estimator. Why is this a desirable property? Let us again consider the case of linear estimators. Set θ i = S i y for i = 1, 2. We have 2 θ i = y t SiS t i y. 2 The property that for all y R n θ 1 2 θ 2 2 is equivalent to the condition that S1S t 1 S2S t 2 is negative semidenite. The trace of negative semidenite matrices is 0. Furthermore trace (Si ts i) = trace (S i Si t ), so we conclude that var ( θ1 ) var ( θ2 ).

15 14 It is known (de Jong (1995)) that β (1) P LS 2 β (2) P LS 2... β (m ) P LS 2 = β OLS 2. 6 The shrinkage factors of PLS In this section, we give a simpler and clearer proof of the shape of the shrinkage factors of PLS. Basically, we combine the results of Butler & Denham (2000) and Phatak & de Hoog (2002). In the rest of the section, we assume that m < m (m) as the shrinkage factors for β P LS = β OLS are trivial, i.e. f (m ) (λ i ) = 1. β (m) By denition of the PLS estimator, P LS K (m). Hence there is a polynomial π of degree m 1 with β P LS = π(a)b. Recall that the eigenvalues (m) of D (m) are denoted by µ (m) i. Set ( ) m f (m) (λ) := 1 1 λ µ (m) i 1 = 1 χ (m) (0) χ(m) (λ). As f (m) (0) = 0, there is a polynomial π (m) of degree m 1 such that f (m) (λ) = λ π (m) (λ). (14) Proposition 15 ((Phatak & de Hoog 2002)). Suppose that m < m. We have β (m) P LS = π (m) (A) b. Proof (Phatak & de Hoog 2002). Using either equation (14) or the Caley- Hamilton theorem (recall proposition 13), it is easy to prove that (D (m)) 1 = π (m) ( D (m)). We plug this into equation (7) and obtain β (m) P LS = W (m) π (m) ( ( W (m)) t AW (m)) ( W (m)) t b. Recall that the columns of W (m) form an orthonormal basis of K (m). It follows that W (m) ( W (m))t is the operator that projects on the space K (m). In particular W (m) ( W (m)) t A j b = A j b

16 15 for j = 1,..., m 1. This implies that β (m) P LS = π (m) (A)b. Using (14), we can immediately conclude the following corollary. Corollary 16 ((Phatak & de Hoog 2002)). Suppose that dim K (m) = m. If we denote by z i the component of β OLS along the ith eigenvector of A then β (m) P LS = p f (m) (λ i ) z i, with f (m) (λ) dened in (14). We now show that some of the shrinkage factors of PLS are 1. Theorem 17 ((Butler & Denham 2000)). For each m m 1, we can decompose the interval [λ p, λ i ] into m + 1 disjoint intervals 1 I 1 I 2... I m+1 such that f (m) (λ i ) { 1 λ i I j and j odd 1 λ i I j and j even. Proof. Set g (m) (λ) = 1 f (m) (λ). It follows from equation (14) that the zero's of g (m) (λ) are µ (m) m,..., µ (m) 1. As D (m) is unreduced, all eigenvalues are distinct. Set µ (m) 0 = λ 1 and µ (m) m+1 = λ p. Dene I j =]µ (m) i, µ (m) i+1 [ for j = 0,..., m. By denition, g (m) (0) = 1. Hence g (m) (λ) is non-negative on the intervals I j if j is odd and g (m) is non-positive on the intervals I j if j is even. It follows from theorem 10 that all intervals I j contain at least one eigenvalue λ i of A. In general it is not true that f (m) (λ i ) 1 for all λ i and m = 1,..., m. Using the example in remark 12 and the fact that f (m) (λ i ) = 1 is equivalent to the condition that λ i is an eigenvalue of D (m), it is easy to construct a counterexample. Using some of the results of section 4, we 1 We say that I j I k if sup I j inf I k.

17 16 can however deduce that some factors are indeed 1. As all eigenvalues of D (m 1) and D (m ) are distinct (proposition 11), we see that f (m 1) (λ i ) 1 for all i. In particular { f (m 1) < 1 m even (λ 1 ) > 1 m odd. More generally, using proposition 11, we conclude that f (m 1) (λ i ) = 1 and f (m) (λ i ) = 1 is not possible. In practice i.e. calculated on a data set the shrinkage factors seem to be 1 all of the time. Furthermore 0 f (m) (λ p ) < 1. To prove this, we set g (m) (λ) = 1 f (m) (λ). We have g (m) (0) = 1. Furthermore, the smallest positive zero of g (m) (λ) is µ (m) m and it follows from theorem 10 and proposition 11 that λ p < µ (m) m. Hence g (m) (λ p ) ]0, 1]. Using theorem 10, more precisely it is possible to bound the terms λ p µ (m) i λ i, 1 λ i. µ (m) i From this we can derive bounds on the shrinkage factors. We do not pursue this further. Readers who are interested in the bounds should consult Lingjaerde & Christopherson (2000). Instead, we have a closer look at the MSE of the PLS estimator. In section 5, we showed that a value f (m) (λ i ) > 1 is not desirable, as both the bias and the variance of the estimator increases. Note however that in the case of PLS, the factors f (m) (λ i ) are stochastic; they depend on y in a nonlinear way. The variance of the PLS estimator for the ith principal component is ( ) var f (m) (λ i ) (v i) t y λi with both f (m) (λ i ) and vt i y λi depending on y.

18 17 Among others, Frank & Friedman (1993) propose to truncate the shrinkage factors of the PLS estimator in the following way. Set +1 f (m) (λ i ) > +1 f (m) (λ i ) = 1 f (m) (λ i ) < 1 f (m) (λ i ) otherwise and dene a new estimator: β (m) T RN := p f (m) (λ i )z i. (15) If the shrinkage factors are numbers, this will improve the MSE (cf. section 5). But in the case of stochastic shrinkage factors, the situation might be dierent. Let us suppose for a moment that f (m) (λ i ) = λ i. It follows that vi t y ( 0 = var f (m) (λ i ) vt i y ) ( var f (m) (λ i ) vt i y ), λi λi so it is not clear whether the truncated estimator TRN leads to a lower MSE, which is conjectured e.g. in Frank & Friedman (1993). The assumption that f (m) (λ i ) = λ i v t i y is of course purely hypothetical. It is not clear whether the shrinkage factors behave this way. It is hard if not infeasible to derive statistical properties of the PLS estimator or its shrinkage factors, as they depend on y in a complicated, nonlinear way. As an alternative, we compare the two dierent estimators on dierent data sets. 7 Experiments In this section, we explore the dierence between the methods PLS and TRN. We investigate several articial datasets and one real world example. Simulation We compare the MSE of the two methods - PLS and truncated PLS - on 27 dierent articial data sets. We use a setting similar to the one in Frank & Friedman (1993). For each data set, the number of examples is n = 50. We consider three dierent number of predictor variables: p = 5, 40, 100.

19 18 The input data X is chosen according to a multivariate normal distribution with zero mean and covariance matrix C. We consider three dierent covariance matrices: C 1 = I p, 1 (C 2 ) ij = i j + 1, { 1, i = j (C 3 ) ij = 0.7, i j. The matrices C 1, C 2 and C 3 correspond to no, moderate and high collinearity respectively. The regression vector β is a randomly chosen vector β {0, 1} p. In addition, we consider three dierent signal-to-noise ratios: var (Xβ) stnr = σ 2 = 1, 3, 7. We yield = 27 dierent parameter settings. For each setting, we estimate the MSE of the two methods: For k = 1,..., K = 200, we generate y according to (1) and (2). We determine for each method and each m the respective estimator β k and dene MSE( β) = 1 K K k=1 ) t ( βk β ( βk β). If there are more predictor variables than examples, this approach is not sensible, as the true regression vector β is not identiable. This implies that dierent regressions vectors β 1 β 2 can lead to Xβ 1 = Xβ 2. Hence for p = 100, we estimate the MSE of ŷ for the two methods. We display the estimated MSE of the method TRN as a fraction of the estimated MSE of the method PLS, i.e. for each m we display ( ) MSE β (m) T RN MSE RAT IO = ( MSE β (m) P LS ). (As already mentioned, we display the MSE-RATIO for ŷ in the case p = 100.) The results are displayed in gures 1, 2 and 3. In order to have a compact representation, we consider the averaged MSE-RATIOS for dierent parameter settings. E.g. we x a degree of collinearity (say high collinearity) and display the averaged MSE-RATIO over the three dierent signal-to-noise ratios. The results for all 27 data sets are shown in the tables in the appendix.

20 Figure 1: MSE-RATIO for p = 5. The gures show the averaged MSE- RATIO for dierent parameter settings. Left: Comparison for high (straight line), moderate (dotted line) and no (dashed line) collinearity. Right: Comparison for stnr 1 (straight line), 3 (dotted line) and 7 (dashed line) Figure 2: MSE-RATIO for p = 40. There are several observations: The MSE of TRN is lower almost all of the times. The decrease of the MSE is particularly large if the number of components m is small, but > 1. For larger m, the dierence decreases. This is not surprising, as for large m, the dierence between the PLS estimator and the OLS estimator decreases. Hence we expect the dierence between TRN and PLS to become smaller. The reduction of the MSE is particularly prominent in complex situations, i.e. in situations with collinearity in X or with a low signal-to-noise-ratio. Another feature which cannot be deduced from gures 1, 2 and 3 but from the

21 Figure 3: MSE-RATIO for p = 100. In this case, we display the MSE-RATIO for ŷ instead of β. Only the rst 20 components are displayed. tables in the appendix is the fact that the optimal numbers of components ( ) m opt P LS = argmin M SE β (m) P LS ( ) m opt T RN = argmin MSE β (m) T RN are equal almost all of the times. This is also true if we consider the MSE of ŷ. We can benet from this if we want to select an optimal model for truncated PLS. We return to this subject in section 8. Real world data In this example, we consider the near infrared spectra (NIR) of n = 171 meat samples that are measured at p = 100 dierent wavelengths from nm. This data set is taken from the StatLib datasets archive and can be downloaded from The task is to predict the fat content of a meat sample on the basis of its NIR spectrum. We choose this dataset as PLS is widely used in the chemometrics eld. In this type of applications, we usually observe a lot of predictor variables which are highly correlated. We estimate the MSE of the two methods PLS and truncated PLS by computing the 10fold cross-validated error of the two estimators. The results are displayed in gure 4. Again, TRN is better allmost all of the times, although the dierence is small. Note furthermore that the optimal number of components are almost identical for the two methods: We have m opt P LS = 15 and mopt T RN = 16.

22 Figure 4: 10fold cross-validated test error for the Tecator data set. The straight line corresponds to PLS, the dashed line corresponds to truncated PLS. 8 Discussion We saw in section 7 that bounding the absolute value of the PLS shrinkage factors by one seems to improve the MSE of the estimator. So should we now discard PLS and always use TRN instead? There might be (at least) two objections. Firstly, it would be somewhat lightheaded if we relied on results of a small-scale simulation study. Secondly, TRN is computationally more extensive than PLS. We need the full singular value decomposition of X. In each step, we have to compute the PLS estimator and adjust its shrinkage factors by hand. However, the experiments suggest that it can be worthwhile to compare PLS and truncated PLS. We pointed out in section 7 that the two methods do not seem to dier much in terms of optimal number of components. In order to reduce the computational costs of truncated PLS, we therefore suggest the following strategy. We rst compute the optimal PLS model on a training set and choose the optimal model with the help of a model selection criterion. In a second step, we truncate the shrinkage factors of the optimal model. We then use a validation set in order to quantify the dierence between PLS and TRN and choose the method with the lower validation error. References Butler, N. & Denham, M. (2000), `The Peculiar Shrinkage Properties of Partial Least Squares Regression', Journal of the Royal Statistical Society Se-

23 22 ries B 62 (3), de Jong, S. (1995), `PLS shrinks', Journal of Chemometrics 9, Frank, I. & Friedman, J. (1993), `A Statistical View of some Chemometrics Regression Tools', Technometrics 35, Helland, I. (1988), `On the Structure of Partial Least Squares Regression', Communications in Statistics, Simulation and Computation 17(2), Höskuldsson, A. (1988), `PLS Regression Methods', Journal of Chemometrics 2, Lanczos, C. (1950), `An Iteration Method for the Solution of the Eigenvalue Problem of Linear Dierential and Integral Operators', Journal of Research of the National Bureau of Standards 45, Lingjaerde, O. & Christopherson, N. (2000), `Shrinkage Structures of Partial Least Squares', Scandinavian Journal of Statistics 27, Parlett, B. (1998), The Symmetric Eigenvalue Problem, Society for Industrial and Applied Mathematics. Phatak, A. & de Hoog, F. (2002), `Exploiting the Connection between PLS, Lanczos, and Conjugate Gradients: Alternative Proofs of some Properties of PLS', Journal of Chemometrics 16, Rosipal, R. & Krämer, N. (2006), Overview and Recent Advances in Partial Least Squares, in `Subspace, Latent Structure and Feature Selection Techniques', Lecture Notes in Computer Science, Springer, Wold, H. (1975), Path models with Latent Variables: The NIPALS Approach, in H. B. et al., ed., `Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building', Academic Press,

24 23 A Appendix: Results of the simulation study We display the results of the simulation study that is described in section 7. The following tables show the MSE-RATIO for β as well as for ŷ. In addition to the MSE ratio, we display the optimal number of components for each method. It is interesting to see that the two quantities are the same almost all of the times. collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 1: MSE-RATIO of β for p = 5. The rst two rows display the setting of the parameters. The rows entitled 1-4 display the MSE ratio for the respective number of components. collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 2: MSE-RATIO of ŷ for p = 5.

25 24 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 3: MSE-RATIO of β for p = 40.

26 25 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 4: MSE-RATIO of ŷ for p = 40.

27 26 collinearity no no no med. med. med. high high high stnr m opt P LS m opt T RN Table 5: MSE-RATIO of ŷ for p = 100. We only display the results for the rst 20 components, as the MSE-RATIO equals 1 (up to 4 digits after the decimal point) for the remaining components.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection

Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection Nicole Krämer Mikio L. Braun Technische Universität Berlin, Franklinstr. 28/29, 10587 Berlin, Germany Abstract Kernelizing partial least squares (PLS), an algorithm which has been particularly popular

More information

Stat 159/259: Linear Algebra Notes

Stat 159/259: Linear Algebra Notes Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the

More information

Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression

Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression Nicole Krämer Machine Learning Group Berlin Institute of Technology Germany nkraemer@cs.tu.berlin.de Masashi Sugiyama Department

More information

Conceptual Questions for Review

Conceptual Questions for Review Conceptual Questions for Review Chapter 1 1.1 Which vectors are linear combinations of v = (3, 1) and w = (4, 3)? 1.2 Compare the dot product of v = (3, 1) and w = (4, 3) to the product of their lengths.

More information

EXTENDING PARTIAL LEAST SQUARES REGRESSION

EXTENDING PARTIAL LEAST SQUARES REGRESSION EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Econ Slides from Lecture 7

Econ Slides from Lecture 7 Econ 205 Sobel Econ 205 - Slides from Lecture 7 Joel Sobel August 31, 2010 Linear Algebra: Main Theory A linear combination of a collection of vectors {x 1,..., x k } is a vector of the form k λ ix i for

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

Eigenvalues and diagonalization

Eigenvalues and diagonalization Eigenvalues and diagonalization Patrick Breheny November 15 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction The next topic in our course, principal components analysis, revolves

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

RITZ VALUE BOUNDS THAT EXPLOIT QUASI-SPARSITY

RITZ VALUE BOUNDS THAT EXPLOIT QUASI-SPARSITY RITZ VALUE BOUNDS THAT EXPLOIT QUASI-SPARSITY ILSE C.F. IPSEN Abstract. Absolute and relative perturbation bounds for Ritz values of complex square matrices are presented. The bounds exploit quasi-sparsity

More information

Chapter 6: Orthogonality

Chapter 6: Orthogonality Chapter 6: Orthogonality (Last Updated: November 7, 7) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved around.. Inner products

More information

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MATH 829: Introduction to Data Mining and Analysis Principal component analysis 1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016 Motivation 2/11 High-dimensional

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

Matrix Factorizations

Matrix Factorizations 1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) 2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8

More information

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP) MATH 20F: LINEAR ALGEBRA LECTURE B00 (T KEMP) Definition 01 If T (x) = Ax is a linear transformation from R n to R m then Nul (T ) = {x R n : T (x) = 0} = Nul (A) Ran (T ) = {Ax R m : x R n } = {b R m

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

arxiv: v1 [math.na] 5 May 2011

arxiv: v1 [math.na] 5 May 2011 ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and

More information

1 Last time: least-squares problems

1 Last time: least-squares problems MATH Linear algebra (Fall 07) Lecture Last time: least-squares problems Definition. If A is an m n matrix and b R m, then a least-squares solution to the linear system Ax = b is a vector x R n such that

More information

Total Least Squares Approach in Regression Methods

Total Least Squares Approach in Regression Methods WDS'08 Proceedings of Contributed Papers, Part I, 88 93, 2008. ISBN 978-80-7378-065-4 MATFYZPRESS Total Least Squares Approach in Regression Methods M. Pešta Charles University, Faculty of Mathematics

More information

Linear Algebra. Session 12

Linear Algebra. Session 12 Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)

More information

CS 246 Review of Linear Algebra 01/17/19

CS 246 Review of Linear Algebra 01/17/19 1 Linear algebra In this section we will discuss vectors and matrices. We denote the (i, j)th entry of a matrix A as A ij, and the ith entry of a vector as v i. 1.1 Vectors and vector operations A vector

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

The Degrees of Freedom of Partial Least Squares Regression

The Degrees of Freedom of Partial Least Squares Regression Journal of the American Statistical Association. vol.106, no.494, pp.697 705, 2011. 1 The Degrees of Freedom of Partial Least Squares Regression icole Krämer Weierstrass Institute Berlin nicole.kraemer@wias-berlin.de

More information

1 Vectors. Notes for Bindel, Spring 2017 Numerical Analysis (CS 4220)

1 Vectors. Notes for Bindel, Spring 2017 Numerical Analysis (CS 4220) Notes for 2017-01-30 Most of mathematics is best learned by doing. Linear algebra is no exception. You have had a previous class in which you learned the basics of linear algebra, and you will have plenty

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

LECTURE VI: SELF-ADJOINT AND UNITARY OPERATORS MAT FALL 2006 PRINCETON UNIVERSITY

LECTURE VI: SELF-ADJOINT AND UNITARY OPERATORS MAT FALL 2006 PRINCETON UNIVERSITY LECTURE VI: SELF-ADJOINT AND UNITARY OPERATORS MAT 204 - FALL 2006 PRINCETON UNIVERSITY ALFONSO SORRENTINO 1 Adjoint of a linear operator Note: In these notes, V will denote a n-dimensional euclidean vector

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

MATH 583A REVIEW SESSION #1

MATH 583A REVIEW SESSION #1 MATH 583A REVIEW SESSION #1 BOJAN DURICKOVIC 1. Vector Spaces Very quick review of the basic linear algebra concepts (see any linear algebra textbook): (finite dimensional) vector space (or linear space),

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

October 25, 2013 INNER PRODUCT SPACES

October 25, 2013 INNER PRODUCT SPACES October 25, 2013 INNER PRODUCT SPACES RODICA D. COSTIN Contents 1. Inner product 2 1.1. Inner product 2 1.2. Inner product spaces 4 2. Orthogonal bases 5 2.1. Existence of an orthogonal basis 7 2.2. Orthogonal

More information

Recall the convention that, for us, all vectors are column vectors.

Recall the convention that, for us, all vectors are column vectors. Some linear algebra Recall the convention that, for us, all vectors are column vectors. 1. Symmetric matrices Let A be a real matrix. Recall that a complex number λ is an eigenvalue of A if there exists

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Linear Algebra - Part II

Linear Algebra - Part II Linear Algebra - Part II Projection, Eigendecomposition, SVD (Adapted from Sargur Srihari s slides) Brief Review from Part 1 Symmetric Matrix: A = A T Orthogonal Matrix: A T A = AA T = I and A 1 = A T

More information

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH V. FABER, J. LIESEN, AND P. TICHÝ Abstract. Numerous algorithms in numerical linear algebra are based on the reduction of a given matrix

More information

MTH 2032 SemesterII

MTH 2032 SemesterII MTH 202 SemesterII 2010-11 Linear Algebra Worked Examples Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education December 28, 2011 ii Contents Table of Contents

More information

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Introduction to Iterative Solvers of Linear Systems

Introduction to Iterative Solvers of Linear Systems Introduction to Iterative Solvers of Linear Systems SFB Training Event January 2012 Prof. Dr. Andreas Frommer Typeset by Lukas Krämer, Simon-Wolfgang Mages and Rudolf Rödl 1 Classes of Matrices and their

More information

Math 408 Advanced Linear Algebra

Math 408 Advanced Linear Algebra Math 408 Advanced Linear Algebra Chi-Kwong Li Chapter 4 Hermitian and symmetric matrices Basic properties Theorem Let A M n. The following are equivalent. Remark (a) A is Hermitian, i.e., A = A. (b) x

More information

COMP 558 lecture 18 Nov. 15, 2010

COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to

More information

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Model Evaluation and Selection Predictive Ability of a Model: Denition and Estimation We aim at achieving a balance between parsimony

More information

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent. Lecture Notes: Orthogonal and Symmetric Matrices Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Orthogonal Matrix Definition. Let u = [u

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS LINEAR ALGEBRA, -I PARTIAL EXAM SOLUTIONS TO PRACTICE PROBLEMS Problem (a) For each of the two matrices below, (i) determine whether it is diagonalizable, (ii) determine whether it is orthogonally diagonalizable,

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

1. General Vector Spaces

1. General Vector Spaces 1.1. Vector space axioms. 1. General Vector Spaces Definition 1.1. Let V be a nonempty set of objects on which the operations of addition and scalar multiplication are defined. By addition we mean a rule

More information

Strongly Regular Decompositions of the Complete Graph

Strongly Regular Decompositions of the Complete Graph Journal of Algebraic Combinatorics, 17, 181 201, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Strongly Regular Decompositions of the Complete Graph EDWIN R. VAN DAM Edwin.vanDam@uvt.nl

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

Math Linear Algebra II. 1. Inner Products and Norms

Math Linear Algebra II. 1. Inner Products and Norms Math 342 - Linear Algebra II Notes 1. Inner Products and Norms One knows from a basic introduction to vectors in R n Math 254 at OSU) that the length of a vector x = x 1 x 2... x n ) T R n, denoted x,

More information

AN ONLINE NIPALS ALGORITHM FOR PARTIAL LEAST SQUARES. Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William T. Pike

AN ONLINE NIPALS ALGORITHM FOR PARTIAL LEAST SQUARES. Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William T. Pike AN ONLINE NIPALS ALGORIHM FOR PARIAL LEAS SQUARES Alexander E. Stott, Sithan Kanna, Danilo P. Mandic, William. Pike Electrical and Electronic Engineering Department, Imperial College London, SW7 2AZ, UK

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

The Lanczos and conjugate gradient algorithms

The Lanczos and conjugate gradient algorithms The Lanczos and conjugate gradient algorithms Gérard MEURANT October, 2008 1 The Lanczos algorithm 2 The Lanczos algorithm in finite precision 3 The nonsymmetric Lanczos algorithm 4 The Golub Kahan bidiagonalization

More information

1 Cricket chirps: an example

1 Cricket chirps: an example Notes for 2016-09-26 1 Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number

More information

ANALYSIS OF NONLINEAR PARTIAL LEAST SQUARES ALGORITHMS

ANALYSIS OF NONLINEAR PARTIAL LEAST SQUARES ALGORITHMS ANALYSIS OF NONLINEAR PARIAL LEAS SQUARES ALGORIHMS S. Kumar U. Kruger,1 E. B. Martin, and A. J. Morris Centre of Process Analytics and Process echnology, University of Newcastle, NE1 7RU, U.K. Intelligent

More information

Chapter 4 Euclid Space

Chapter 4 Euclid Space Chapter 4 Euclid Space Inner Product Spaces Definition.. Let V be a real vector space over IR. A real inner product on V is a real valued function on V V, denoted by (, ), which satisfies () (x, y) = (y,

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008 Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008 Exam 2 will be held on Tuesday, April 8, 7-8pm in 117 MacMillan What will be covered The exam will cover material from the lectures

More information

Lecture 5 Singular value decomposition

Lecture 5 Singular value decomposition Lecture 5 Singular value decomposition Weinan E 1,2 and Tiejun Li 2 1 Department of Mathematics, Princeton University, weinan@princeton.edu 2 School of Mathematical Sciences, Peking University, tieli@pku.edu.cn

More information

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88 Math Camp 2010 Lecture 4: Linear Algebra Xiao Yu Wang MIT Aug 2010 Xiao Yu Wang (MIT) Math Camp 2010 08/10 1 / 88 Linear Algebra Game Plan Vector Spaces Linear Transformations and Matrices Determinant

More information

UNIT 6: The singular value decomposition.

UNIT 6: The singular value decomposition. UNIT 6: The singular value decomposition. María Barbero Liñán Universidad Carlos III de Madrid Bachelor in Statistics and Business Mathematical methods II 2011-2012 A square matrix is symmetric if A T

More information

22m:033 Notes: 7.1 Diagonalization of Symmetric Matrices

22m:033 Notes: 7.1 Diagonalization of Symmetric Matrices m:33 Notes: 7. Diagonalization of Symmetric Matrices Dennis Roseman University of Iowa Iowa City, IA http://www.math.uiowa.edu/ roseman May 3, Symmetric matrices Definition. A symmetric matrix is a matrix

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2

2 Garrett: `A Good Spectral Theorem' 1. von Neumann algebras, density theorem The commutant of a subring S of a ring R is S 0 = fr 2 R : rs = sr; 8s 2 1 A Good Spectral Theorem c1996, Paul Garrett, garrett@math.umn.edu version February 12, 1996 1 Measurable Hilbert bundles Measurable Banach bundles Direct integrals of Hilbert spaces Trivializing Hilbert

More information

Foundations of Matrix Analysis

Foundations of Matrix Analysis 1 Foundations of Matrix Analysis In this chapter we recall the basic elements of linear algebra which will be employed in the remainder of the text For most of the proofs as well as for the details, the

More information

Regression and Statistical Inference

Regression and Statistical Inference Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

Review problems for MA 54, Fall 2004.

Review problems for MA 54, Fall 2004. Review problems for MA 54, Fall 2004. Below are the review problems for the final. They are mostly homework problems, or very similar. If you are comfortable doing these problems, you should be fine on

More information

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified

More information

Learning with Singular Vectors

Learning with Singular Vectors Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:

More information

Lecture 14 Singular Value Decomposition

Lecture 14 Singular Value Decomposition Lecture 14 Singular Value Decomposition 02 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Goals for today singular value decomposition condition numbers application to mean squared errors

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information