I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Size: px

Start display at page:

Download "I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN"

Donald Carson
6 years ago
Views:

1 Principal Analysis Edps/Soc 584 and Psych 594 Applied Multivariate Statistics Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees, University of Illinois Principal Analysis Slide 1 of 93

2 Outline History and overview Geometry Algebra Principal obtained from Standardized Variables Graphing Principal Distinctions between PCA and factor analysis Graphing Principal Reading: Johnson & Wichern pages & ; good supplemental references Jolliffe (1986), Krzanowski (1988); Flury (1988). Principal Analysis Slide 2 of 93

3 History Graphing Principal First introduced by Karl Pearson (1901) in Philosophical Magazine as a procedure for finding lines and planes which best fit a set of points in p-dimensional space. The focus was on geometric optimization. Harold Hotelling (1933) published a paper on PCA in Journal of Educational Psychology, which dealt with an algebraic optimization. He re-invented it but from a different perspective. His motivation was to find a smaller fundamental set of independent variables that determines the values of the original set of p variables. This is a factor analytic type idea, but PCA is not factor analysis (except in a very special and unrealistic case). Hotelling choose components (linear combinations of p variables) so as to maximize their successive contribution to the total variance. Principal Analysis Slide 3 of 93

4 History continued Not much was done with respect to applications until the early 1960 s the advent of the computer age. There was an explosion of applications and developments of the technique. Theory for sampling distributions (which lead to statistical inference) were developed. Lots of Extensions of PCA (e.g., PCA for sets of matrices... for SAS/IML macros (by me) and MATLAB (by Mark de Rooij) code see algorithm is based on work by Kiers (1990). Graphing Principal Principal Analysis Slide 4 of 93

5 Basic Idea Reduce the dimensionality of a data set in which there is a large number of inter-related variables while retaining as much as possible the variation in the original set of variables. The reduction is achieved by transforming the original variables to a new set of variables, principal components, that are uncorrelated and ordered such that the first few retains most of the variation present in the data. Goals & Objectives Reduction and summary data reduction. Study the structure of Σ (or S or R) Interpretation. Graphing Principal Principal Analysis Slide 5 of 93

6 Applications Interpretation (study structure) Create a new set of variables (a smaller number that are uncorrelated). These can be used in other procedures (e.g., multiple regression). Select a sub-set of the original variables to be used in other multivariate procedures. Detect outliers or clusters of observations. Check multivariate normality assumption (before assuming multivariate normality and analyzing data using procedures that assume multivariate normality. Graphing Principal Principal Analysis Slide 6 of 93

7 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 7 of 93 Probability Contour All your observations (measurements) on made on the members of the population. European countries in one study could be considered the population and you have data for each of them (the variables are percents of people employed in different industries). The psychological test data consist of measurements on 64 subjects. These subjects are a sample from some populations. If we repeated the study, we d most likely have different individuals. In Population principal components, we can compute Σ and the principal components (PCs) are derived from Σ.

8 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 8 of 93 Probability Contour Two approaches Algebraically: PCs are linear combinations of p original variables X 1, X 2,..., X p such that The first PC has the largest variance as possible, The second PC has the largest variance as possible and is orthogonal to the first etc. Geometrically: (at least) 3 approaches Rotation to a new coordinate system. Best fit hyper-plane. See appendix of the text for n space interpretation

9 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 9 of 93 Probability Contour Geometry of PCA: p space PCs represent a selection of a new coordinate system obtained by rotating the original axes to a set of new axes (to provide a simpler structure). The first principal component represents the direction of maximum variability. The second principal component represents the direction of maximum variability that is orthogonal to the first. And so on, until the last PC which represents the direction of minimum variability & orthogonal to all of the others. Best fit is defined as minimizing the sum of squared distances between points that represent cases and space defined by principal components The first principal component defines a line. The sum of squared distances (i.e., 2 j=1 d2 j ) between the points and this line are minimized. The first Two principal components define a plane. The sum of squared distances between points and this plan are minimized. etc.

10 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 10 of 93 Probability Contour Axis Rotation & Best Fit Line x 2 (weight), varx 2 = var(y 1 ) = y 1 (size) sum of (distance)2 are minimized ւ x 1 (height) var(x 1 ) = θ y 2 (shape) var(y 2 ) = 0.236

11 Further Notes regarding PC They are variance preserving. For example, Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 11 of 93 Probability Contour var(x 1 )+var(x 2 )= =2.015 = =var(y 1 )+var(y 2 ) If you rotate PCs, you no longer have PCs. PCs only depend on Σ (or R if you re using standardized variables). PCs do not require any assumptions about distribution of the variables (e.g., multivariate normality). If variables do come from a multivariate normal populations, then PCs can be interpreted in terms of constant density ellipsoids. You can make inferences about the population from a sample. However, right now we re considering Population PC, so we don t have a sample and hence no inference is required.

12 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 12 of 93 Probability Contour The Algebra of Population PCA We want to transform p variables to q orthogonal linear combinations (generally) where q << p. to X 1 p = (X 1, X 2,...,X p ) Y 1 q = (Y 1, Y 2,...,X q ) There are p possible ones Y 1 = a 1X = a 11 X 1 + a 12 X a 1p X p Y 2 = a 2X = a 21 X 1 + a 22 X a 2p X p.. Y p = a px = a p1 X 1 + a p2 X a pp X p Y = AX Given the covariance matrix Σ X of the X s, we know var(y i ) = a iσ X a i and cov(y i, Y k ) = a iσ X a k

13 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 13 of 93 Probability Contour More Formal Definition of PCs PCs are the uncorrelated linear combinations, cov(y i, Y k ) = 0 for all i k, with variances as large as possible. In particular, var(y 1 ) is the maximum find a 1 a 1Σ X a 1 = max(a Σ x a) var(y 2 ) is the maximum and Y 1 find a 2 a 2Σ X a 2 = max(a Σ x a) and a 1 Σ X a 2 = 0 At each step, select a i such that a i X has maximum variance subject to being uncorrelated with all other linear combinations. Usually (but not always), we only use Y 1, Y 2,..., Y q where q is much less than p (primary goal is data reduction. The are p possible components, Y 1, Y 2,..., Y p are needed to completely reproduce (represent) Σ X. So if q < p, we don t reproduce Σ X exactly (unless the rank of Σ X = q).

14 Maximizing the Criteria The criteria to be maximized is max(a Σ X a). Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 14 of 93 Probability Contour We can always multiply Y 1 = a X by a constant c > 1, which will increase the variance, varcy 1 = var(ca X) = c 2 var(a X). Therefore, we normalize the combination vector a a = 1 = L 2 a = La Our problem is to find a 1 that maximizes variance subject to a constraint ( a ) Σ X a max = var(y a a 1 ) a Use results on maximization in more linear algebra notes ( a ) Σ X a max = λ a a 1 a which is attained when a = e 1 where λ 1 and e 1 are the first eigenvalue and eigenvector of Σ X.

15 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 15 of 93 Probability Contour Proof that this is Maximum Showing is better than just believing... a Σ X a a a = var(y 1 ) a Σ X a = var(y 1 )a a a Σ X a var(y 1 )a a = 0 a (Σ X a var(y 1 )a) = 0 (since a 0) Σ X a var(y 1 )a = 0 Σ }{{} X }{{} a p p p 1 = var(y 1 ) }{{}}{{} a scalar p 1 which is just the equation what eigenvalues and eigenvectors solve. So Y 1 = e 1X where e 1 is the 1 st eigenvector of Σ X and var(y 1 ) = λ 1.

16 Population PC: Result 1 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 16 of 93 Probability Contour Let Σ be the covariance matrix associated with the vector X = (X 1, X 2,...,X p ). Let Σ have the eigenvector-eigenvalues pairs (λ 1, e 1 ), (λ 2, e 2 ),..., (λ p, e p ) where λ 1 λ 2 λ p 0. Then the i th PC is given by Y i = e ix = e i1 X 1 + e i2 X e ip X p for i = 1, 2,...,p. Given this var(y i ) = e iσe i = e i(λ i e i ) = λ i e ie i = λ i and for i k cov(y i, Y k ) = e i Σe k = e i(λ k e k ) = λ k e ie k = 0 If some of the λ 1 are equal, then the choice of the corresponding coefficient vectors e i (and thus Y i ) are not unique.

17 Population PC continued We can write all of this in terms of matrices: Y = P X = cov(y ) = Σ Y = P Σ X P Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 17 of 93 Probability Contour So, Σ X }{{} cov(x) = PΛP Σ Y }{{} cov(y ) = Λ = P Σ X P = diag(λ i )

18 More Population PC Results Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 18 of 93 Probability Contour Let X = (X 1, X 2,...,X p ) have covariance matrix Σ X with eigenvalue and eigenvector pairs pairs (λ 1, e 1 ), (λ 2, e 2 ),..., (λ p, e p ) where λ 1 λ 2 λ p 0. Let Y 1 = e 1X, Y 2 = e 2X,... Y p = e px be the PCs. Then σ 11 + σ σ pp = p σ ii = λ 1 + λ λ p = i=1 The Total Population Variance is preserved by the transformation. The Proportion of total variance due to the k th PC is λ k λ 1 + λ λ p = k p i=1 λ i k = 1,...,p p i=1 λ i

19 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 19 of 93 Probability Contour Proportion of Variance Accounted For We often select q PCs such that the proportions for k = 1,...,q sum up as close to 1 (yet not too large of a value for q). The Proportion of Variance accounted for by the first q PCs equals q k=1 λ k trace(σ X ) We try to balance the percent of variance (information) retained and the number of PCs (simplicity). We may want to replace X by Y. Often we re interested interpreting the new variables (i.e., the PCs), so we examine the elements of the e i s The size (magnitude) of the elements of e i are an indicator of a variables importance to the i th PC....

20 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 20 of 93 Probability Contour Correlation between Y i and X k If Y 1 = e i X, Y 2 = e 2 X,... Y p = e p X are the PCs obtained from Σ X, we can use ρ Yi,X k to help interpret the contribution of an X k to Y i. ρ Yi,X k = cov(y i, X k ) λi σkk = cov(e i X, l X) where l 1 p = (0,..., 1 λi σkk }{{}, 0,..., 0) k th position = = l Σe i λi σkk l (λ i e i ) λi σkk = e ik λi σkk

21 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 21 of 93 Probability Contour Example: European Cars The data are percentages of people employed in different industries in European countries during 1979 (cold war era). Data from Euromonitor (1979) European Marketing Data and Statistics, London: Euromonitor Publications... I go it off of the web from N = 26 countries There are 9 industries, but we ll start with just p = 3: X 1 = percent in manufacturing. X 2 = percent in services industry. X 3 = percent in social and personal services. µ = Σ = total variance = trace(σ) =

22 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 22 of 93 Probability Contour Example: Eigenvalues and Eigenvectors var(y i ) Cumulative Cumulative i λ i variance Percent Percent Eigenvectors, which give weights for principal components: e 1 = (0.580, 0.396, 0.712) e 2 = (0.811, 0.207, 0.546) e 3 = ( 0.069, 0.894, 0.442) So the Principal component are Y 1 = 0.580X X X 3 Y 2 = 0.811X X X 3 Y 3 = 0.069X X X 3

23 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 23 of 93 Probability Contour Example: Interpretation of We ll look at correlations between Y 1 and Y 2 and each of the X k s: Principal Original Variables Y 1 Y 2 Manufacturing X (0.580) = (0.811) =.75 Service X (0.396) =.69 Social & Personal X (0.712) = ( 0.207) = ( 0.546) =.52 Y 1 : All variables are contributing to the first component; it s an overall percent employment in all industries. Y 2 : This contrasts Manufacturing with Service and Social & Personal.

24 Plot of Component Scores Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 24 of 93 Probability Contour

25 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 25 of 93 Probability Contour If Population is Multivariate Normal We have an additional interpretation if X N p (µ,σ). Recall that the probability density contours (ellipsoids) are (X µ) Σ 1 (X µ) The center is at µ and the axes are at µ ± c λ i e i, where λ i and e i are the i th eigenvalue and vector of Σ. The principal components are Y 1 = e 1X Y 2 = e 2X.. Y p = e px The Principal components lie in the same directions as the axes of the probability contours (ellipsoids)

26 Probability Contour X 2 (x j1, x j2 ) = µ + ce 1 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 26 of 93 Probability Contour Y 2 = e 2X (x j1, x j2 ) Y 1 = e 1X X 1

27 Center at (0, 0) X 2 = (X 2 µ 2 ) In X coordinates: (x j1, x j2 ) = ce 1 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 27 of 93 Probability Contour Y 2 = e 2X տ (x j1, x j2 ) Y 1 = e 1X X = (X 1 µ 1 ) In PC coordinates: (Y i, 0) = (e 11 X 1 + e 12 X, 0)

28 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 28 of 93 Probability Contour Summary example when X N p (µ, Σ) Any point on the i th axis of the ellipsoid has X coordinates = µ + ce i. X coordinates that are proportional to e i = (e i1, e i2,...,e ip ) in the coordinate system that has origin at µ and axes parallel to the original X axes (i.e., the X coordinates). Subtracting mean doesn t change anything except move the origin to (0, 0). In the coordinate system of the PC s the point has principal component (Y i, 0), because PC s are obtained by a rigid rotation of the original coordinate axes through an angle θ until they coincide with the axes of the ellipsoid. All of these results generalize to p > 2.

29 When Variances are Very Different Principal obtained from Standardized Variables If we use standardized variables ( z-scores) Z 1 = X 1 µ 1 σ11 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 29 of 93 Probability Contour or in matrix notation Z = V 1/2 }{{} diag(1/ σ ii ) Z 2 = X 2 µ 2 σ22.. Z p = X p µ p σpp (X µ) }{{} p 1 = V 1/2 X V 1/2 µ So Z is a linear combination of X, which means...

30 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 30 of 93 Probability Contour PCs of Standardized Variables We know that E(Z) = E(V 1/2 (X µ)) = V 1/2 E(X) V 1/2 µ = 0 }{{} µ and Σ Z = V 1/2 Σ X V 1/2 = R which is the (population) correlation matrix of the X s. The i th PC of the standardized variables Z = (Z 1, Z 2,...,Z p ) with Σ Z = R is given by Ỹ = ẽ iz = ẽ i(v 1/2 (X µ)) for i = 1, 2,...,p where ẽ i is the i th eigenvector and λ i is the i th eigenvalue of R. Note that p i=1 var(z i ) = p λ i = i=1 p var(ỹ i ) = trace(r) = p i=1

31 PCs of Standardized versus non-std Variables Almost always λ i λ i and e i ẽ i Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 31 of 93 Probability Contour That is The PCs from Σ X are not the same as PCs from R We ll look at a situation where standardization makes a difference This will be the case when the scales of the X variables are (substantially or vastly) different and they are ont comparable.

32 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 32 of 93 Probability Contour Men s Track Data From Johnson & Wichern: The data are from the Track and Field Statistics Handbook for the 1984 Los Angeles Olympics. These data are the national record times for men before the 1984 Olympics. The record times for eight races (i.e., p = 8) are listed for 55 countries (i.e., n = 55). The times are recorded for the following races: m: Record time for 100m race in seconds m: Record time for 200m race in seconds m: Record time for 400m race in seconds m: Record time for 800m race in minutes m: Record time for 1500m race in minutes 6. 5K: Record time for 5000m race in minutes 7. 10K: Record time for 10000m race in minutes 8. Marathon: Record time for the Marathon (approx. 26 miles) in minutes

33 Summary Statistics Summary Statistics for each variable are given below: Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 33 of 93 Probability Contour 100m 200m 400m 800m 1500m 5K 10K Marathon x s Covariance Matrix (truncated values) m100 m200 m400 m800 m1500 K5 K10 Mara. m m m m m K K Marathon

34 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 34 of 93 Probability Contour Eigenvalues of Σ From the SAS/PRINCOMP Procedure: Total Variance Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative

35 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 35 of 93 Probability Contour Eigenvectors of Σ Principal Race Prin1 Prin2 Prin3 m m m m m K K Marathon The 1st principal component is essentially the marathon, because it has by far the largest variance compared to the next largest which is 3.26 (the 10K). The variance on the 1st component is

36 The Correlation Matrix Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 36 of 93 Probability Contour Values are Truncated m100 m200 m400 m800 m1500 K5 K10 Mara. m m m m m K K Marathon

37 The Eigenvalues of the Correlation Matrix Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 37 of 93 Probability Contour Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative Total variance = 8. The first 2 principal components account for 93.8% of the total variance.

38 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 38 of 93 Probability Contour The Eigenvectors of the Correlation Matrix The First Two Eigenvectors Component Loadings Race m m m m m K k Marathon First component: An overall measure High values on this component indicate slower runners. Second component: Contrast long and short races Small values indicate faster on short races than long ones. Large values indicate slower on short races than long ones. Value near zero means that tend to be similar on short and long races (could be slow, fast, or somewhere in between on all races).

39 Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 39 of 93 Probability Contour Correlations(Variables, ) The correlations between the standardized variables and values on the principal components equal r Zk,Y i = λ i e ki (e.g., 6.622(.318) =.82) Race m m m m m K k Marathon Interpretation pretty much the same.

40 Graph of Countries Component Scores Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 40 of 93 Probability Contour

41 Graph of Countries Component Scores Two approaches Geometry of PCA: p space Axis Rotation & Best Fit Line Further Notes regarding PC The Algebra of Population PCA More Formal Definition of PCs Maximizing the Criteria Proof that this is Maximum Population PC: Result 1 Population PC continued More Population PC Results Proportion of Variance Accounted For Correlation between Y i and X k Example: European Cars Example: Eigenvalues and Eigenvectors Example: Interpretation of Plot of Component Scores If Population is Multivariate Normal Principal Analysis Slide 41 of 93 Probability Contour

42 Used to summarize the sample variation by PCs. Sample Principal Graphing Principal The Algebra is the same as in population principal components. x 1, x 2,...,x n are n independent observations from a population with µ and Σ. x p 1 = sample mean vector. S p p = {s ik } = sample covariance matrix. S has eigenvalue/vector pairs (ˆλ 1, ê 1 ),...,(ˆλ p, ê p ) where ˆλ 1 ˆλ 2 ˆλ p. The ˆ indicates these are estimates of population values. The i th sample principal component is given by ŷ i = ê i x = ê i1 x 1 + ê i2 x ê ip x p The i th PC sample variance = var(ŷ i ) = ˆλ i for i = 1,...,p. The PC sample covariances = cov(ŷ i, ŷ k ) = 0 for all i k. Principal Analysis Slide 42 of 93

43 Algebra of Sample PC continued Sample Principal Total sample variance trace(s) = tr(s) = p s ii = i=1 p ˆλ i Proportion of total sample variance accounted for by the i th PC ˆλ i p k=1 ˆλ k Correlations between ŷ i and x k rŷi,x k = ˆλi skk ê ik i=1 Graphing Principal Note if you use standardized x s, then rŷi,z k = ˆ λiˆẽ ik The sample PCs based on S are not the same as those based on R. (I ll use to denote those based on R). Use S when observations are not in the same unit or when the variances s ii are not vastly different. Principal Analysis Slide 43 of 93

44 Geometry of Sample PC Sample Principal PCs based on a sample of n p-dimensional observations are new variables specified by a rigid rotation of the original axes to a new orientation such that the directions of the axes in the new orientation have maximum variances in the sample. The rotation must be rigid since the new variables must be. Directions of the new axes are based on S (or R) x 2 Graphing Principal x ŷ 2 = ê 2x θ θ ŷ 1 = ê 1x x 1 Principal Analysis Slide 44 of 93

45 Geometry of Sample continued Sample Principal The PCs are projections of observations onto the principal axes of the ellipsoids. We can re-center the x s, which also centers the ŷ s; that is (x i x) = 0 ŷ i has mean 0 Subtraction of x only effects the mean and does not effect variances ( and ) covariance. ( ) ( ) x 1 x 2 shift location x 1 x 1 x 2 x 2 rigid rotatation ŷ 1 ŷ 2 Graphing Principal ŷ 2 x 2 x 2 ŷ 1 x 1 x 1 Principal Analysis Slide 45 of 93

46 2nd Geometric Interpretation The 1st PC ŷ 1 minimizes the sum of squared deviations (distances) of the points to a line (least squared best fit). Sample Principal When you approximate p-dimensional data by r << p PCs, the t PCs minimize the sum of squared distances of points in p-space onto the r dimensional sub-space. x 2 x 2 ŷ 1 ŷ 2 x 1 x 1 Graphing Principal Principal Analysis Slide 46 of 93

47 Swiss Bank Notes These data are from Flurry & Riedwyl (1988) Multivariate Statistics: A practical approach. Sample Principal The data consist of p = 6 measurements in millimeters on n = 100 genuine Swiss Bank notes (old ones)... show picture x 1 : Length of the bank note, x 2 : Height of the bank note, measured on the left, x 3 : Height of the bank note, measured on the right, x 4 : Distance of inner frame to the lower border, x 5 : Distance of inner frame to the upper border, x 6 : Length of the diagonal. Graphing Principal Principal Analysis Slide 47 of 93

48 Picture of Bank Note X 2 X 5 X 6 X1 Sample Principal X 3 X 4 Graphing Principal Principal Analysis Slide 48 of 93

49 Swiss Bank Notes: sample statistics Sample Means: x = ( , , , 8.305, , ) Sample Principal The sample covariance matrix S: Length Left Right Bottom Top Diagonal X 1 X 2 X 3 X 4 X 5 X 6 X X X X X X Graphing Principal Principal Analysis Slide 49 of 93

50 Eigenvalues of S The variances of the principal components (i.e., the eigenvalues of S): Sample Principal Proportion of Cummulative PC ˆλi of variance Proportion Graphing Principal Principal Analysis Slide 50 of 93

51 Eigenvectors of Genuine Bank notes The principal components (eigenvectors of S): Sample Principal Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Length X Left X Right X Bottom X Top X Diagonal X Graphing Principal Principal Analysis Slide 51 of 93

52 Correlation between measures and PCs The correlations between the original variables and the principal components (i.e., r xk,y i = ê ki ˆλi / s kk ): Sample Principal Graphing Principal Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Length X Left X Right X Bottom X Top X Diagonal X Y 1 is a contrast between Bottom & Top. Y 2 is overall size, except for Diagonal. Y 3 & Y 4 nothing obvious. Y 5 is something like image. Y 6 measurement error or slant of cut. Principal Analysis Slide 52 of 93

53 The Latter PCs We ve focused on the first PCs, but the last ones can also be informative. Sample Principal Graphing Principal Small values for the smallest eigenvalues from either S or R indicate: Undetected linear dependencies in the data. One (or more) of the variables is redundant with others and could be deleted. Such PCs can be substantively just an important as PCs associated with the largest eigenvalues. The latter ones could be due to pure error variability (measurement error). Swiss Bank Note example: The last PC is basically X 2 X 3 = (Right) (Left). Typically, X 2 X 3 > 0. So this last PC could Reflect the slant of the cut. If X 2 and X 3 are measuring the same thing (quantity), the only reason that ˆλ 6 > 0 is due to measurement error (error variability). Principal Analysis Slide 53 of 93

54 Asymptotic & complex Graphing Principal If X 1, X 2,...,X n is a sample from N p (µ,σ) then the sample principal components Ŷ i = ê i(x X) are observations ( realizations ) of the population principal components Y i = e i(x µ) and since ŷ i is a linear combination of x which come from N p (µ,σ) Ŷ = where Λ = diag(λ i ). Ŷ 1 Ŷ 2.. Ŷ p N p(0,λ) Principal Analysis Slide 54 of 93

55 continued Assume X j N p (µ,σ) i.i.d. for j = 1, 2,...n. Σ, which is unknown, has eigenvalues λ 1 > λ 2 > > λ p (assumption) with associate eigenvectors e 1, e 2,...,e p. For n very large n 1. ˆλ i is independent of its corresponding ê i. 2. n(ˆλ λ i ) N p (0, 2Λ 2 ) or that ˆλ N p (λ, 2 n Λ2 ) Graphing Principal where ˆλ are eigenvalues of S, and λ are eigenvalues of Σ. So ˆλ i N 1 (λ i, 2 n λ2 i ) for i = 1,...,p 3. n(ê i e i ) N p (0, E i ) where λ k E i = λ i (λ k λ i ) 2 e ke k k i Note:E i is not diagonal, and the Eigenvectors are not independent. Principal Analysis Slide 55 of 93

56 Using Distribution of ˆλ s Since the ˆλ i s are asymptomatically (very large n) independent and normal with mean λ i and variance (2/n)λ 2 i, a (1 α)100% confidence interval for λ i is ˆλ i (1 + z α/2 2/n) λ i ˆλ i (1 z α/2 2/n) where z α/2 is the upper (α/2) th percentile of N(0, 1). Graphing Principal or If we can do a Bonferroni-type procedure and use z α/(2m) where m = number of intervals you plan to constructs. Swiss Bank Note Example: The 95% confidence interval for λ 1 is : λ and the rest are on the next slide (.5395,.9534) Principal Analysis Slide 56 of 93

57 Swiss Bank note: CI s for λ s Proportion of Cumulative 95% Confidence Intervals PC ˆλi of variance Proportion Lower Upper Graphing Principal Principal Analysis Slide 57 of 93

58 Using the Distribution of ê i The ê i s are approximately normal with mean e i. The elements of each ê i are correlated and these correlations depend on the ratios λ k (λ k λ i ) 2 That is, how far λ k is from λ i. It can be useful to look at the diagonal elements of (1/n)Êi. These are the standard errors of ê ki s. Recall Ê i = ˆλ i k i ˆλ i (ˆλ k ˆλ i ) 2 êkê k Graphing Principal Notes: 1. The variances of ˆλ increases as λ increases, so large λ s can have very wide confidence intervals. 2. These sampling results do not apply to R they only apply to S. Principal Analysis Slide 58 of 93

59 Testing H o : λ i = λ for i = (r + 1),...,p) Bartlett (1947) developed a test for the hypothesis that (p r) smaller eigenvalues of Σ are equal for 0 < r < p 1. Graphing Principal If data support this hypothesis, then there probably will be little interest in using more than r components. Bartlett s approximate χ 2 statistics has the following form [ ] r M ln(det(s)) + ln(λ i ) + (p r) ln(λ) where M = n r 1 6 λ = 1 (p r) i=1 ( 2(p r) ( tr(s) r i=1 df = 1 (p r 1)(p r + 2) 2 ) 2 (p r) ) λ i Principal Analysis Slide 59 of 93

60 Bartlett s Test continued Graphing Principal Lawley (1956) gave a modification to Bartlett s test. Anderson (1963) discusses related test; that is, the hypothesis that some k intermediate eigenvalues are equal (i.e., H o : λ 1, λ 2,...,λ q, λ q+1,...λ q+k, λ q+k+1...,λ p }{{} all equal Bartlett s test Swiss Bank note example: p = 6, n = 100, r = 3 and H o : λ 4 = λ 5 = λ 6. ( 2(p r) λ = 1 p r M = n r (p r) = ( ) 2 2(6 3) (6 3) = = ( ) r tr(s) λ i = 1 ( ) = i=1 ) Principal Analysis Slide 60 of 93

61 Swiss Bank Note Example det(s) = Graphing Principal r ln(λ i ) = i=1 Test Statistic = M ( ln(det(s)) + ) r ln(λ i ) + (n r) ln(λ) i=1 = ( ( ) + ( ) + (100 3)(ln(λ))) = df = 1 2 (p r 1)(p r + 2) = 1 2 (2)(5) = 5 Comparing to a chi-square distribution with df = 5, we find p value=.02 How about another value for r? (SAS module). Principal Analysis Slide 61 of 93

62 Graphing Principal Compute y i = e ix and plot these. Reveal suspect observations (outliers, influential observations). Check multivariate normality assumptions. Look for clusters. Provide insight into structure in the data. Suspect Observations The first PCs can help reveal influential observations: those that contribute more to variances than other observations such that if we removed them the results change quite a bit. The last PCs can help to reveal outliers: those observations that are a typical of the data set; they re inconsistent with the rest of the data (could be miss-coded). Graphing Principal Principal Analysis Slide 62 of 93

63 Swiss Bank Notes: Outliers? Graphing Principal Principal Analysis Slide 63 of 93

64 Why Look at Last to find Outliers? Multivariate outliers may not be extreme on any of the original variables. They can still be an outlier in multivariate space because they do not conform with the correlational structure of the rest of the data. Mathematically explanation: Recall that Ŷ p 1 = ˆP p p X p 1 where P = (e 1, e 2,...,e p ). So since PP = P P = I, X = ˆP Ŷ = The X s are a linear combination of the principal components (i.e., the Ŷ s). Consider an observation x j, x j = ˆPŷ j = ŷ 1j ê 1 + ŷ 2j ê ŷ pj ê p = (ŷ 1j ê ŷ q 1,j ê q 1 ) + (ŷ qj ê q + + ŷ pj ê p ) Graphing Principal Principal Analysis Slide 64 of 93

65 Outliers & Influential Observations The size (magnitude) of the last PCs determine how well the first few PCs fit observations; that is, (ŷ 1j ê 1 + +ŷ q 1,j ê q 1 ) differs from x j by (ŷ qj ê q + +ŷ pj ê p ) The suspect observations are the ones where at least one of the coordinates ŷ qj,...,ŷ pj is large. The influential observations are also based on the fact that x j = P y j. Again consider x j = (y 1j e y q 1,j e q 1 ) +(y qj e q + + y pj e p ) }{{} large y values here Graphing Principal Principal Analysis Slide 65 of 93

66 Potential Influential Observations in Men s Track Graphing Principal Principal Analysis Slide 66 of 93

67 Men s Track Data: Influential Observations? Western Somoa and the Cook Islands are off the scale when we did principal component analysis of the Men s track data. Graphing Principal When we removed these two countries... All The Data Without The Two Eigenval. Prop. Cum. Eigenval. Prop. Cum Principal Analysis Slide 67 of 93

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson More Linear Algebra Edps/Soc 584, Psych 594 Carolyn J. Anderson Department of Educational Psychology I L L I N O I S university of illinois at urbana-champaign c Board of Trustees, University of Illinois