Principal Component Analysis (PCA) Theory, Practice, and Examples

Size: px

Start display at page:

Download "Principal Component Analysis (PCA) Theory, Practice, and Examples"

Joan Whitehead
6 years ago
Views:

1 Principal Component Analysis (PCA) Theory, Practice, and Examples

2 Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A n X

3 Data Reduction Residual variation is information in A that is not retained in X balancing act between clarity of representation, ease of understanding oversimplification: loss of important or relevant information.

4 Principal Component Analysis (PCA) probably the most widely-used and wellknown of the standard multivariate methods invented by Pearson (1901) and Hotelling (1933) first applied in ecology by Goodall (1954) under the name factor analysis ( principal factor analysis is a synonym of PCA).

5 Principal Component Analysis (PCA) takes a data matrix of n objects by p variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original p variables the first k components display as much as possible of the variation among objects.

6 Geometric Rationale of PCA objects are represented as a cloud of n points in a multidimensional space with an axis for each of the p variables the centroid of the points is defined by the mean of each variable the variance of each variable is the average squared deviation of its n values around the mean of that variable. V i n 1 1 n X X im i m 1 2

7 Geometric Rationale of PCA degree to which the variables are linearly correlated is represented by their covariances. C ij Covariance of variables i and j n 1 Sum over all n objects 1 n X X X X im i jm j m 1 Value of variable i in object m Mean of variable i Value of variable j in object m Mean of variable j

8 Geometric Rationale of PCA objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties: ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance,..., and axis p has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated).

9 2D Example of PCA variables X 1 and X 2 have positive covariance & each has a similar variance Variable X X X Variable X 1 V V C,

10 Configuration is Centered each variable is adjusted to a mean of zero (by subtracting the mean from each value) Variable X Variable X 1

11 Principal Components are Computed PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance PC PC 1

12 The Dissimilarity Measure Used in PCA is Euclidean Distance PCA uses Euclidean Distance calculated from the p variables as the measure of dissimilarity among the n objects PCA derives the best possible k dimensional (k < p) representation of the Euclidean distances among objects.

13 Generalization to p-dimensions In practice nobody uses PCA with only 2 variables The algebra for finding principal axes readily generalizes to p variables PC 1 is the direction of maximum variance in the p-dimensional cloud of points PC 2 is in the direction of the next highest variance, subject to the constraint that it has zero covariance with PC 1.

14 Generalization to p-dimensions PC 3 is in the direction of the next highest variance, subject to the constraint that it has zero covariance with both PC 1 and PC 2 and so on... up to PC p

15 each principal axis is a linear combination of the original two variables extended to p dimensions: PC i = a i1 X 1 + a i2 X 2 + a in X p a ij s are the coefficients for PC factor i, multiplied by the measured value for variable j 8 6 PC 1 PC 2 4 Variable X Variable X 1

16 PC axes are a rigid rotation of the original variables PC 1 is simultaneously the direction of maximum variance and a least-squares line of best fit (squared distances of points away from PC 1 are minimized). 8 6 PC 1 PC 2 4 Variable X Variable X 1

17 Generalization to p-dimensions if we take the first k principal components, they define the k-dimensional hyperplane of best fit to the point cloud of the total variance of all p variables: PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions i.e. the squared Euclidean distances among points calculated from their coordinates on PCs 1 to k are the best possible representation of their squared Euclidean distances in the full p dimensions.

18 Covariance vs Correlation using covariances among variables only makes sense if they are measured in the same units even then, variables with high variances will dominate the principal components these problems are generally avoided by standardizing each variable to unit variance and zero mean. X X X im im SD i i Mean variable i Standard deviation of variable i

19 Covariance vs Correlation covariances between the standardized variables are correlations after standardization, each variable has a variance of correlations can be also calculated from the variances and covariances: Correlation between variables i and j r ij Variance of variable i C i ij V V j Covariance of variables i and j Variance of variable j

20 The Algebra of PCA first step is to calculate the crossproducts matrix of variances and covariances (or correlations) among every pair of the p variables square, symmetric matrix diagonals are the variances, off-diagonals are the covariances. X 1 X 2 X X Variance-covariance Matrix X 1 X 2 X X Correlation Matrix

21 The Algebra of PCA in matrix notation, this is computed as S X X where X is the n x p data matrix, with each variable centered (also standardized by SD if using correlations). X 1 X 2 X X Variance-covariance Matrix X 1 X 2 X X Correlation Matrix

22 Manipulating Matrices transposing: could change the columns to rows or the rows to columns X = X = multiplying matrices must have the same number of columns in the premultiplicand matrix as the number of rows in the postmultiplicand matrix

23 The Algebra of PCA sum of the diagonals of the variancecovariance matrix is called the trace it represents the total variance in the data it is the mean squared Euclidean distance between each object and the centroid in p-dimensional space. X 1 X 2 X X X 1 X 2 X X Trace = Trace =

24 The Algebra of PCA finding the principal axes involves eigenanalysis of the cross-products matrix (S) the eigenvalues (latent roots) of S are solutions ( ) to the characteristic equation S I 0

25 The Algebra of PCA the eigenvalues, 1, 2,... p are the variances of the coordinates on each principal component axis the sum of all p eigenvalues equals the trace of S (the sum of the variances of the original variables). X 1 X 2 X X Trace = = = Note: =

26 The Algebra of PCA each eigenvector consists of p values which represent the contribution of each variable to the principal component axis eigenvectors are uncorrelated (orthogonal) their cross-products are zero. Eigenvectors u 1 u 2 X X *( ) * = 0

27 The Algebra of PCA assume there are n data objects, each with p attributes data matrix X the coordinates of each object i on the k th principal axis, known as the scores on PC k, are computed as z ki u 1k x 1i u 2k x 2i u pk x pi where Z is the n x k matrix of PC scores, X is the n x p centered data matrix and U is the p x k matrix of eigenvectors.

28 The Algebra of PCA variance of the scores on each PC axis is equal to the corresponding eigenvalue for that axis the eigenvalue represents the variance displayed ( explained or extracted ) by the k th axis the sum of the first k eigenvalues is the variance explained by the k-dimensional ordination.

29 1 = = Trace = PC 1 displays ( explains ) / = 76.5% of the total variance PC PC 1

30 The Algebra of PCA The cross-products matrix computed among the p principal axes has a simple form: all off-diagonal values are zero (the principal axes are uncorrelated) the diagonal values are the eigenvalues. PC 1 PC 2 PC PC Variance-covariance Matrix of the PC axes

31 A more challenging example data from research on habitat definition in the endangered Baw Baw frog 16 environmental and structural variables measured at each of 124 sites correlation matrix used because variables have different units Philoria frosti

32 Eigenvalues Axis Eigenvalue % of Variance Cumulative % of Variance

33 How many axes are needed? does the (k+1) th principal axis represent more variance than would be expected by chance? several tests and rules have been proposed a common rule of thumb when PCA is based on correlations is that axes with eigenvalues > 1 are worth interpreting in our example 4 Eigenvectors fit this criterion (we shall keep 3 for simplicity)

34 7.0 Baw Baw Frog - PCA of 16 Habitat Variables Eigenvalue PC Axis Number

35 Interpreting Eigenvectors correlations between variables and the principal axes are known as loadings each element of the eigenvectors represents the contribution of a given variable to a component the loadings of variables on the first three PCs are shown here PC 1 PC 2 PC 3 Altitude ph Cond TempSurf Relief maxerht averht %ER %VEG %LIT %LOG %W H1Moss DistSWH DistSW DistMF

36 Significance of Variables we can compute the significance of the variables as the sum of squared loadings on to the most significant Eigenvectors we selected (3 in our example) the next slide shows the table of the last slide expanded with these squared loadings we can then sort the table by the squared loadings and make a scree plot the most significant variables are those above some chosen cutoff, for example 0.4 (marked in yellow in the table)

37 Significance of Variables PC 1 PC 2 PC 3 sum of squared loadings Altitude ph Cond TempSurf Relief maxerht averht %ER %VEG %LIT %LOG %W H1Moss DistSWH DistSW DistMF

38 Scree plot Significance of Variables chosen significance threshold variables considered significant more aggressive reduction of variables only eliminate very weak variables

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4 Data reduction, similarity & distance, data augmentation