Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Size: px

Start display at page:

Download "Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA"

Phoebe Ward
6 years ago
Views:

1 Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In PCA, we want to construct p independent random variables from the original p variables. We can do this via axis rotation obtained by using linear combinations of the original p variables. Example of a rotation, for p : [Graph in class] Point B has x 8 and x 5 based on the axes x and x to define this point, and x is orthogonal to x. o Say we rotate x by θ 0, and call this new axis Z. o Z is a linear combination of X and X : o z cos( θ) x + cos( θ) x, whereθ θ andθ 90 θ o A second, axis, Z is then orthogonal (90 ) to Z. o z cos( θ ) x + cos( θ) x, whereθ 90 + θ andθ θ Therefore, a rotation of the axes creates a new set of axes which are linear combinations of the original ones. The point can be redefined using these new axes.

2 Using matrices instead of angles, [ z z ] [ x x ] cos( θ) cos( θ) cos( θ) cos( θ ) [ z z ] [ x x ] (X) a a (X) a a (X) If A A I n Z XA For a proper rotation, reflection), A., then a rotation occurs, and A is an orthogonal matrix. Therefore, a linear transformation of: A. For an improper rotation (rotation and Z XA is a rigid rotation if A is an orthogonal matrix and A. Looking at the transformation matrix, A, a a A a a [ a a ]. a j a j for every column from to p columns (only columns for this example). i.e, multiply any column by itself and you will get. This means that the transformation matrix, A, is normal. NOTE: This does not have anything to do with the normal distribution! a a j k 0.. For any pair of columns (j and k). This means that A, is orthogonal (90 ). i.e., multiply any two columns and you will get zero.

3 For the example, using a 0 orthogonal rotation: cos( θ) A cos( θ) cos( θ) cos( θ) cos(0) A cos(70) cos(0) cos(0) a a a a a a [ ] [ ] [ ] 0 Using linear transformations for p variables and n observations PCA uses principle axis rotation, based on a linear combination of the p variables to create new variables, Zj There will be p new Z variables from the p original X variables. The Z j o are known as the principle components. o are created by obtaining a linear combination of the p original variables. o Will be orthogonal to each other, even though the original X variables were not For principle component one: v jk zi vxi + vxi + v3xi3 + L + v p are the coefficients (multipliers) used in the linear transformation of the X s to the Z s z i represents the principle component score for the ith observation and principle component (sometimes called PC in textbooks) For n observations, there will be n values for each X variable x ip

4 n values (principle component scores) for z i There will be p of these principle components, each one having a different set of multipliers (the form: v jk differ for each principle component. In matrix Z X V ( nxp) ( nxp) ( pxp) each column of the X is a variable each column of the Z is a principle component each row of the X is a measure of the X variables on one observation each row of the Z is principle component scores on one observation each column of V represents the multipliers for a particular PC (one multiplier for each of the p original X variables) V will be an orthogonal matrix, and therefore, the transformation of X represents a rotation of X. Centroid: This is a vector of the average values. For the p X s, there will be p averages. If the average X s are put into the equations for the principle components, you will get the average principle component score. z vx + vx + v3x3 + L + v p x p Choosing a Rotation: What rotation should be used? What then are the v jk values? Objective of PCA: Step. Obtain PC that gives the maximum variance accounted for. Step. PC, is the next maximum variance, but is orthogonal to PC (remember the Gram Smidt process?). Step 3. PC3, is the next maximum variance, but is orthogonal to both PC and PC Step 4. Continue to until there are p principle components from p original X variables. [graph in class]

5 Variances of the X s versus Variances of the Z s: There will be a direct relationship between the covariance matrix for the Z s and the covariance matrix for the X s. There will be a direct relationship between the correlation matrix for the Z s and the correlation matrix for the X s. Why? And what is this relationship? Z X V ( nxp) ( nxp) ( pxp) Sz the corrected sums of squares and cross products for Z Since Sz Z'Z (Z')('Z) n substitute Z with XV,and simplify : Sz ( XV )'( XV) (( XV )')('( XV) ) n ( XV ) V X : Sz ( V X )( XV) (V X') (' XV) n Sz V ( X X) V V ( X') (' X) V n Sz V ( X X) ( X') (' X) V n Sz V Sx V Where Sx the corrected sums of squares and cross products for X. Since the covariance of X is Sx/(n-), and the covariance for Z is Sz/(n- ), CovZ V CovX V In many textbooks, the symbol the symbol Λ Σ X is used for the covariance of X, and is used for the covariance of Z, then: Λ V Σ X V

6 Eigenvalues and eigenvectors: The principal components will be independent (orthogonal). Therefore, all covariances will be equal to zero. The covariance matrix for the Z s will a diagonal matrix, with each element representing the variance for a principal component: λ 0 Λ M 0 0 λ M 0 L L O L For example, λ is the variance for principle component. These are also called eigenvalues or latent roots or characteristic roots of the covariance Σ X matrix for the X variables,. The transformation matrix used to make a linear transformation of the X variables, by rotating the axes (V, the multipliers) are called the eigenvectors or latent vectors or characteristic vectors (one column, a vector, for each principle component). Therefore: the eigenvectors matrix, V, is used to transform the covariance matrix of X, to get the more simple covariance matrix for Z. This is the same as transforming the X variables using a linear model of these multipliers (eigenvector values) to get the Z variables (principle components). Getting the eigenvectors and eigenvalues is also called diagonalizing the matrix. Since V is an orthogonal matrix (rigid rotation), Z will be uncorrelated (orthogonal) may be easier to interpret the Z variables than interpret the X variables. Also: λ p Σ X V VΛ and X' ΣX X Z' ΛZ SO there is no loss of information. The same information is just represented on new basis vectors! Simply transformed data.

7 If the X variables are linearly dependent, than : Then Σ X 0, and Σ X doesn t exist (singular matrix, not of full rank) THEN, at least one of the eigenvalues (the diagonal elements of Λ ) will be zero. Since: p j var( Z j p p ) λ j j j var( X j ) trace( Σ If the matrix is not full rank, we can reduce the number of principle components, as some of them will have 0 variance. Will still represent all of the X variables variances. We can further reduce by eliminating those principle components that have very low variance. Will still represent most of variance in the X variables. Uses of PCA: PCA, then: Can be used to create independent variables, Z, from a set of correlated variables, X. Can be used to reduce the number of dimensions, from p to some smaller value for ease of interpretation. We can also use the PCA results to reduce the number of X variables, as not all variables will have high impact (correlation) with the new Z variables These principle components can be used to: As indices in further analyses, like factor analysis As indices to replace a number of variables, e.g., PC may represent soil productivity, as measured by many variables. These can then be graphed, used in further analyses, etc, and these new variables will be uncorrelated In graphs simpler views, and the new Z variables are uncorrelated As a first step in analysis to aid in reducing the number of variables. (More on this later) X )

8 How do we find the eigenvectors and eigenvalues? Each eigenvector is a (p X ) column of multipliers, that we will label as v, (any column of V), then: v will be a (p X ) transformation matrix (a column vector, actually) we want to get this eigenvector (multipliers), to rotate the axes (transform the X variables), but do not want a stretch. Constrain v v. We want this rotation to maximize the variance of the first principle component. Problem then is to maximize: v Sx v subject to preventing a stretch by using the constraint v v. This means the eigenvalues will be normal with a length of (no stretch). NOTE: Instead of the corrected sums of squares and cross products matrix, Sx, we can use the covariance matrix for X, the correlation matrix for X. Σ X, or we can use Can use Lagrangian multipliers to get the solution (the eigenvector) ( v ) F v Sx v λ v To maximize this function subject to the constraint, we first take partial derivatives with respect to v: δ F δ v Sx v λv Then, set this equal to a zero matrix ( (p X ) matrix of zeros), and divide through by to get: ( Sx λi ) 0 p Sx v λv v (pxp) (pxp) (p X) The solution for this will give the maximum for v Sx preventing a stretch by using the constraint v. v v subject to

9 The solution could be to let v be a column of zeros not very useful (called a trivial solution). For a better solution, let: Sx λ I p This will give us the characteristic equation for the matrix Sx. Then, we can solve for λ. Will be more than one possible value. These will be the eigenvalues. Then, substitute each eigenvalue into: 0 0 ( Sx λi ) v And solve to obtain the eigenvector associated with that eigenvalue. Example: p (easy to get determinant of this size of matrix) p S x 0 0 λi λ 0 0 λ S x λi 0 λ 0 λ Then: S x λi (0 λ)(0 λ) ()() (Since the Rule is (ad - bc) for the determinant of a X matrix) Which is 00 0λ 0λ + λ 44 λ 30λ + 56

10 Can solve this using (math formula to solve any quadratic equation): for x : ax + bx + c 0 b ± b 4ac a For this problem, the x is the λ, a, b-30 and c56. We then get λ and λ. ( 30) + λ ( 30) λ ( 30) () ( 30) () 4()(56) 8 4()(56) We have non-zero eigenvalues since there were X-variables, and the matrix was full rank, with a non-zero determinant. Proof: S x 0 0 (0)(0) ()() 56 The eigenvalues matrix is then: Λ λ 0 0 λ We need the eigenvectors. Using the first eigenvalue: Using H ( λ I ) Sx, then: p ( Sx p ) v 0 λ I 0 H v H v

11 Can get this inverse of the H matrix by: cof H H H Step : Find the cofactor (also called adjunct) matrix for H, using the formula given earlier as: cij ( ) i+ j T M For each element of the cofactor of H. Then, get the transpose of this. Note: All the columns of H will be proportional to each other. Step : Calculate the length of any column of cof H, by the length of that vector (that column): p c i i This prevents the stretch length of the eigenvector will be equal to. Step 3: Divide the selected column of cof H by the length of that vector, and this will be v, the first eigenvector. Since the columns of cof H are proportional, any column will give the same value. Step 4. Repeat the steps with each of the eigenvalues. ij

12 For the example: 0 λ H For λ 8, 0 8 H 0 λ cof H T 8 cof H 8 e. g., for i, j : 8 ( ) For a X matrix, the submatrics, Mij, are a single value (X) Then, we can use either column or column. Using column, the length of this vector is: ( 8) + ( ) ) 4. 4 Then, we can calculate the eigenvector as: v 8 /4.4 / Double check this: So this works!

13 Try column, instead. The length of column is: ( ) + ( 8) ) ( NOTE : v / / Same result; we can use either column. Must then repeat this for the other eigenvalue, 0 λ H For λ, 0 H Then, solve for v. [not shown] 0 λ 8 0 8

14 Properties of the PCA:. Eigenvectors will be orthonormal each column has a length of, and if we multiply any two columns, we will get 0.. If the covariance of the X variables is used, the sum of the eigenvalues will be equal to the sum of diagonal elements (the trace) of covariance matrix for the X-variables preserved. Σ X, so the variance is 3. The eigenvalues are the variances for the principle components, and are ordered from largest to smallest from PC to the last principle component. 4. The first axis defined as principle component, will explain the most variation of the original X-variables, the next axis will be the next most variance explained, etc. In graphical form, the first axis is the major axis of the ellipse, etc. 5. The rotation will be a rigid rotation (just a rotation), produced by taking a linear transformation of the X-variables. The transformation matrix used (the eigenvectors) are orthonormal. 6. All principle components are independent of each other. The covariance for any pair of principle component scores (new variables created) will be If you multiply the eigenvalues together, you get the determinant of the matrix for the X variables (Sx matrix, or Σ X, or correlation matrix, depending on which was used in the PCA).

15 Using the correlation matrix instead of the covariance matrix for the X- variables:. The correlation matrix of the X-variables is the same as covariances for the normalized X-variables (subtract the means and divide by the standard deviations). Therefore, we could use ) the normalized X-variables and do a PCA using the covariance matrix of the normalized X-variables OR ) we can use the correlation matrix of the original X-variables in a PCA will give the same result. 3. When a correlation matrix (or a covariance matrix for the standardized X-variables) is used instead of a covariance matrix (for the original X-variables), a. The eigenvalues will sum to the number of X-variables, p, since the diagonal values of the correlation matrix are all equal to. b. Using the correlation matrix removes the differences in measurement scales among the X-variables 4. Commonly, the correlation matrix is used in PCA.

16 Application of PCA The idea behind the application of PCA to a problem is to transform the original set of variables into another set (principle components -- Zj). Principle components are:. Linear functions of the original variables.. Orthogonal. 3. There are as many principle components as original variables. 4. The total variation of the principle components is equal to that originally present. 5. The variation associated with each component decreases in order. Distinctions between PCA and Factor Analysis:. In factor analysis, the "p" variables are transformed into m<p factors.. Factor analysis has the potential for rotating the axis to an oblique position. 3. In factor analysis, only a portion of the variation that is attributable to the m factors (communalities); the remainder is considered error.

17 Uses of PCA:. Dimension reduction: e.g. represent the original variables by 3 principle components. Use principle components in further analyses. For example, the objective on one study is to predict site index (growing potential) from many soil and vegetation variables. Regression analysis will be used to create a prediction equation. However, before regression analysis, PCA is used on the soil and vegetation variables. The new principle component variables (linear combinations of the original soil and vegetation variables) are used to predict the site index, instead of the original variables. Since the principle component variables are orthogonal (variables are independent), the coefficients of the multiple regression analysis can be interpreted directly. However, interpretation may still be difficult since the principle components are linear combinations of the original variables. Another objective was to see if underlying factors could be identified for the soil and vegetation data. PCA was used as the first step to reduce the number of principle components, followed by factor analysis (more on this with factor analysis).. Variable reduction: Look for redundancies in the data. In further analysis, use the reduced set of variables. For the site index example, the PCA could be used to identify which soil and vegetation variables contribute most to the variance of these variables. Using this knowledge, some of the variables are dropped. The remaining set is used to predict site index. Since the variables are not likely orthogonal (called multicollinearity in regression textbooks), the regression coefficients cannot be interpreted directly. However, interpretation may be easier since a reduced set of the original variables is used in the analysis.

18 3. Ordination: Plot priniciple components against each other to look for trends. Used particularly with multiple discriminant analysis and cluster analysis. The principle components for the soil and vegetation data could be plotted. Using these bivariate plots, groups of similar soil/vegetation data points could be identified. Steps in PCA:. Selection of preliminary variables; should be interval or ratio scale otherwise the C or R matrices have no real meaning.. Transform the data so that the data are multivariate normal. This step is not strictly required. This is needed in order to use tests of significance of eigenvalues, but may make interpretation difficult 3. Calculate either the variance/covariance matrix (C) or the correlation matrix (R). Solutions will differ depending on which matrix you select. Variance/covariance matrix is best if units are the same scale. If variables have units of different scale, the correlation matrix is preferred (sum of trace of R p). Easier to interpret the results if the variance/covariance matrix is used. REMEMBER: The correlation matrix (R) is simply the covariance matrix for z-scores (variables are scaled by subtracting the mean and dividing by the standard deviation). 4. Determine the eigenvectors and eigenvalues of the selected matrix. 5. Interpret the derived components.

19 (). Relate principle components to observed variables. Calculate component loadings (correlation between original variables and principle components). (). Decide how many principle components are required by (see paper on this): a. Tests of significance. Only applicable if the data are multivariate normal. b. Set the value for eigenvalues to include to some arbitrary value. If R was used, a value of > is commonly used. If the C matrix was used, set the number as a percent of total variation (i.e. to a cumulative percent of 95). c. Scree test: Plot the eigenvalue versus the principle component number. Look for a trend where additional eigenvalues are small. Problems: May be no obvious break or may be many breaks. (3). Reducing the number of variables. Choose variables with highest eigenvector values if variables are on the same scale. For variables not on the same scale, base selection on the component loadings (simple correlations). May be more than one variable for each principle component. [see handout on different methods to select number of principle components to retain.]

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principal Component Analysis (PCA) Theory, Practice, and Examples Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A