Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation for eigenvalues and eigenvectors of S : (S k I) u k = Eigenvalues: = 9, = 5 Matrix of eigenvalues: = 9 5 Matrix of eigenvectors: U =.89.7.7.89 Positions of the 5 objects in ordination space: F = y y U -. -.5..5. F = 3..6....6.8 3. 3.8.6.89.7.7.89 = 3.578.3.36.3.36 3.3.36 3.3.36 Var 3 5 Var - -
Principal component analysis (PCA) 393 y y (y y ) (a) (b) 6 6 (y y ) 6 8 y 6 8 y y (y y ) II (c) 6 II (d) I 6 3' 6 8 I (y y ) y Figure 9. Numerical example of principal component analysis. (a) Five objects are plotted with respect to descriptors y and y. (b) After centring the data, the objects are now plotted with respect to y y and y y, represented by dashed axes. (c) The objects are plotted with reference to principal axes I and II, which are centred with respect to the scatter of points. (d) The two systems of axes (b and c) can be superimposed after a rotation of 6 3'.
6 II y II (a) II y (b) I y y = 76 35' I I y y 6 3 Fig. 9.3 Numerical example from Fig. 9.. Distance and correlation biplots are discussed in Subsection 9... (a) Distance biplot. The eigenvectors are scaled to lengths. Inset: descriptors (matrix U). Main graph: descriptors (matrix U; arrows) and objects (matrix F; dots). The interpretation of the object-descriptor relationships is not based on their proximity, but on orthogonal projections (dashed lines) of the objects on the descriptor-axes or their extensions. (b) Correlation biplot. Descriptors (matrix U / ; arrows) with a covariance angle of 76 35'. Objects (matrix G; dots). Projecting the objects orthogonally on a descriptor (dashed lines) reconstructs the values of the objects along that descriptors, to within a multiplicative constant. Use the following matrices to draw biplots Distance biplot (scaling ): objects = F, variables = U Correlation biplot (scaling ): objects = G = F /, variables = U sc = U / These two projections respect the biplot rule, that the product of the two projected matrices reconstruct the data Y: Distance biplot: FU' = Y Correlation biplot: G(U / )' = Y
PCA example, three variables # Create the data matrix data.3 <- matrix(c(,3,5,7,9,,,,,6,,5,,,-,-,,),6,3) data.3 [,] [,] [,3] [,] [,] 3 [3,] 5 - [,] 7 6 - [5,] 9 [6,] 5 # Centre the variables data.cent <- scale(data.3, center=true, scale=false) data.cent [,] [,] [,3] [,] - - [,] -3 [3,] - -3 - [,] 3 - [5,] 3 - [6,] # Compute the covariance matrix data.cov <- cov(data.cent) # or, because the data are not standardized: cov(data.3) [,] [,] [,3] [,]. 3.. [,] 3. 5.6. [3,]...8 # Compute the eigenvalues and eigenvectors data.eig <- eigen(data.cov) data.eig $values []...8 $vectors [,] [,] [,3] [,].897.736 [,].736 -.897 [3,].. # Compute the output matrices for scaling U <- data.eig$vectors F.mat <- data.cent %*% U # Compute the output matrices for scaling U.sc <- U %*% diag(data.eig$values^(.5)) G.mat <- F.mat %*% diag(data.eig$values^(-.5))
# Draw the scaling and biplots par(mfrow=c(,)) biplot(f.mat[,c(,)], U[,c(,)]) biplot(f.mat[,c(,3)], U[,c(,3)]) biplot(g.mat[,c(,)], U.sc[,c(,)]) biplot(g.mat[,c(,3)], U.sc[,c(,3)]) Scaling biplots, axes (, ) and (, 3) -.5..5 -. -.5..5. Axis - - 3 Var 3 5 Var Var 6 -.5..5 Axis 3 - - Var 3 5 Var 3 6 Var -. -.5..5. - - Axis - - Axis Scaling biplots, axes (, ) and (, 3) -3 - - 3-3 - - 3 Axis -. -.5..5. 3 Var 3 Var 5 Var 6-3 - - 3 Axis 3 -. -.5..5. 5 Var 3 Var Var 3 6-3 - - 3 -. -.5..5. Axis -. -.5..5. Axis
Data transformations Transform physical variables (Ecology) or characters (Taxonomy) Univariate distributions are not symmetrical! Apply skewness-reduction transformation Variables are not in the same physical units y i y y i y min! Apply standardization z i = ------------ or ranging y' s i = ----------------------------- y y max y min Transform community composition data (Ecology) (species presence-absence or abundance) Reduce asymmetry of distributions! Apply log(y + c) transformation Make community composition data suitable for Euclidean-based ordination methods (PCA, RDA)! Use the chord, chi-square, profile, or Hellinger transformations (Legendre & Gallagher )
Some uses of principal component analysis (PCA) Two-dimensional ordination of the objects: - Sampling sites in ecology - Individuals or taxa in taxonomy A -dimensional ordination diagram is an interesting graphical support for representing other properties of multivariate data, e.g., clusters. Detect outliers or erroneous data in data tables Find groups of variables that behave in the same way: - Species in ecology - Morphological/behavioural/molecular variables in taxonomy Simplify (collinear) data; remove noise Remove an identifiable component of variation e.g., size factor in log-transformed morphological data
Algebra of Correspondence Analysis f i+ Frequency data table Y = f ij = 5 5 5 5 35 5 f +j = 35 3 35 = f ++ p ij = f ij / f ++ p i+ = f i+ / f ++ p +j = f +j / f ++ p Matrix Q q ij p i+ p = +j ij = ------------------ = p i+ p +j O ij E ij E ij ---------------------------- f ++ Matrix Q =.69.577.636.69.3887.69.9.99.667 Cross-product matrix: Q'Q =.6..398..395.66.398.66.59 Compute eigenvalues and eigenvectors of Q'Q : ( Q'Q k I) u k = Eigenvalues: =.96, =. Matrix of eigenvalues: =.96. There are never more than k = min(r, c ) eigenvalues > in CA Matrix of eigenvectors of Q'Q(c c) : U (c k) = Matrix of eigenvectors of QQ' (r r) : Û(r k) = QU / =.786.336.383.85.59.579.53693.5583.33.7956.8339.356
Compute matrices F and V for scaling biplot, and Vˆ and Fˆ for scaling biplot: CA biplot scaling type CA biplot scaling type Sp.3 Site_ Site_ Sit e_ Site_3 Sp. Sp. 3 Sp. Sp. Site_3 Sp. Site_ -.5 -. -.5..5.. 5. CA axis -. -.5..5..5. CA axis Calculation details Compute matrices V, Vˆ, F, and Fˆ used in the ordination biplots: V (c k) = D(p +j ) / U where p +j = f +j /f ++ Vˆ (r k) = D(p i+ ) / Û where p i+ = f i+ /f ++ F (r k) = Vˆ / Fˆ (c k) = V / Biplot, scaling type : plot F for sites, V for species: This projection preserves the chi-square distance among the sites. The sites are at the centroids (barycentres) of the species. Biplot, scaling type : plot Vˆ for sites, Fˆ for species: This projection preserves the chi-square distance among the species. The species are at the centroids (barycentres) of the sites.
Principal coordinate analysis (PCoA) Example: a Euclidean distance matrix computed from the data of the PCA example. Transform D to a new matrix A = [a hi ]:. 3.68 3.68 7.77 7.77 3.68..7.7 6.356 D = 3.68.7. 6.356.7 7.77.7 6.356..7 7.77 6.356.7.7. a =.5D hi hi Centre matrix A to produce matrix Δ with sums of the rows and columns equal to : Δ = [δ hi ] = [ a hi a h ai + a].8.8.8...8 6.8 3..8 9. Δ =.8 3. 6.8 9..8..8 9..8.8. 9..8.8.8 Compute the eigenvalues and eigenvectors of matrix Δ. Scale the eigenvectors to lengths equal to the square roots of their respective eigenvalues, u' k u k = λ k. Eigenvalues: λ = 36 λ = Objects Eigenvectors x 3.578. x.3.36 x.3.36 x 3.3.36 x 3.3.36 Eigenvector length 36 = 6. =.7 PCoA eigenvalues / (n ) 9 5 Compare the PCoA eigenvalues and eigenvectors to the PCA eigenvalues and matrix F. PCoA is used for ordination of D matrices produced by functions other than the Euclidean.