EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Size: px

Start display at page:

Download "EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS"

Cori Small
5 years ago
Views:

1 EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that data are usually characterized as a set of n objects with relevant features described by a set of p variables, complexity reduction can be achieved either by reduction of variables or by reduction of objects. Here we consider the first problem assuming variables to be numerical in nature. A first, and obvious, method for simplification of features is to select a subset of q p features able to retain the desired information. A second, more general, method is to look for q p transformations of the observed features able to retain the desired information. Principal component analysis (PCA) belongs to this second family of methods and is a typical step when trying to understand the structure of multidimensional numerical data. 2 Preliminaries: redundancy, rank and linear transformations If we want to preserve the main information given by the observed features, it is clear that complexity reduction is only possible when there is some redundancy in the data. Mathematics offers a useful first notion of redundancy, which is rank. Since p is usually much lower than n, the rank of a data frame can be taken to be the number of linearly independent variables (columns) meaning that, if some variables are linear combinations of the others, then they do not offer new information and can be dropped without loosing anything. The rank can be computed as the number of strictly positive singular values of the data frame. Below we examine the rank of Swiss notes data set. > bn <- read.table(file= + header=true) > str(bn) data.frame : 200 obs. of 7 variables: $ Id : Factor w/ 200 levels "BN1","BN10","BN",..: $ Length : num $ Left : num $ Right : num $ Bottom : num $ Top : num $ Diagonal: num > SVD <- svd(bn[, -1]) > str(svd) 1

2 2 PRELIMINARIES: REDUNDANCY, RANK AND LINEAR TRANSFORMATIONS 2 List of 3 $ d: num [1:6] $ u: num [1:200, 1:6] $ v: num [1:6, 1:6] > # Singular values > SVD$d [1] Here the rank is 6 and equals the number of variables. Let us consider two more variables, that is, the perimeter and the area of each note. > peri <- 2*bn$Length + bn$left + bn$right > area <- bn$length * (bn$left + bn$right)/2 > bn1 <- data.frame(bn[, -1], peri, area) > names(bn1) <- c(names(bn[, -1]), Perimeter, Area ) > SVD1 <- svd(bn1) > # Singular values > SVD1$d [1] e e e e e+00 [6] e e e-13 Now the last singular value, up to precision tolerance, is zero reflecting the fact that perimeter is a linear combination of side lengths of the notes. Hence the rank is 7 = p Note that addition of area, being a non linear transformation of side lengths, does not contribute to redundancy. Another useful notion of redundancy is offered by the correlation matrix R R(X)of a data frame X. An observed p p correlation matrix R(X) varies between two extreme correlation matrices Identity matrix I p = diag(1,..., 1), All-ones matrix J p = 1 p 1 T p. The identity matrix corresponds to the situation where variables are linearly independent, hence there is no redundancy and it is not possible to reduce complexity. The J-matrix corresponds to the seemingly opposite situation where just one variable carries real information all the others being perfect linear transformations of it. The ranks of I p and J p are p and 1, respectively. Underlying the previous notions of redundancy and rank there is a particular class of data transformations, called linear combinations. We introduce a useful notation to represent a general linear combination and recall some properties. Let X be the n p numerical matrix of the data and let a = (a 1,..., a p ) T be a general p-vector. The linear combination z = (z 1, z 2,..., z n ) T associated to vector a is the transformation z z(a) = Xa. (1) As z i = a T x i = p j=1 a jx ij, the values of the linear combination can be interpred as generalized means, the generalization being that the a j are general real numbers, whereas for means they must be non negative numbers summing to one. As an example, the perimeter of Swiss notes is the linear combination associated to vector a = (2, 1, 1, 0, 0, 0) T and the perimeter of the general note is z i = 2x i1 + x i2 + x i3, i = 1,..., n. The properties of a linear combination depend on the transformation vector a and the reference data frame. In particular, the mean and the variance are

3 3 PRINCIPAL COMPONENTS 3 z = a T x = s 2 Z = a T Sa = p a j x j, (2) j=1 p s ij a i a j. (3) The expression of the variance is particularly important. We regard variance as the information content (proportional to L2 norm of errors about the mean) of the corresponding variable. From (??), the variance of a linear combination is a quadratic form depending on the underlying data through variances and pairwise covariances of the observed variables. Hence, the variance of a linear combination incorporates the information about spread as well as linear interdependence, filtered by the coefficients a j. To avoid variance explosion, often normalized linear combinations are considered, where the coefficients a j satisfy the constraint p j=1 a2 j = 1. In this case, the range of variation of a is the boundary of the unit radius hypersphere centered at the origin, instead of the entire euclidean space. Another property of linear combinations is that they preserve normality, when data are normally distributed. 3 Principal components In the previous examples, the vector a of a linear combination was given. But in data analysis the vector a is often determined so as to achieve specific goals. In these situations it is typically a function of the observed data. This is exactly the case of principal components. The principal components of a numerical data frame X are p uncorrelated normalized linear combinations Z 1, Z 2,..., Z p ordered according to an information criterion. The first principal component is the normalized linear combination with maximum variance, the second principal component is the normalized linear combination with maximum variance subject to the constraint of being uncorrelated with the first one, and so on. The last principal component can also be characterized as the normalized linear combination with minimum variance. The computation of principal components is simple because a) the vectors a 1, a 2,..., a p of the optimal linear combinations are known to be the orthonormal eigenvectors of the covariance matrix of X and b) their variances are the corresponding eigenvalues l 1 l 2... l p 0. A geometrical interpretation is also available. The principal component transformation is a rotation of p-dimensional space to new axes, called principal axes, corresponding to directions of maximum variation of the data. For normal data, the principal axes are the axes of the ellipsoids of concentration, that is, the contours of normal density function. In typical applications, PCA includes three steps: 1. preliminary data transformations, always including column centering of the data frame and sometimes column standardization to unit variance, i,j=1 2. rotation to principal axes and computation of principal component scores, 3. selection of an optimal subset of q p principal components. The optimality criterion used in the last step is an information criterion. Recalling that principal components are ordered according to decreasing variance, we have to retain enough pc s so as their cumulated variance approximates the total variance of the original variables. In practice, we consider the information ratio R 2 (q) = q j=1 l j p j=1 l j, (4) and choose q so as R 2 (q) is sufficiently high. Values in the range 70-80% are considered satisfactory.

4 3 PRINCIPAL COMPONENTS 4 pc Variances Figure 1: Swiss banknotes. Ordered variances of principal components. 3.1 Worked example: PCA of centered Swiss notes data Below PCA of Swiss notes is performed. We consider first the analysis of column centered data. > class <- rep(c(0,1), c(, )) > col <- rep(c( black, red ), c(, )) > pc <- princomp(bn[, -1], cor=false) > str(pc) List of 7 $ sdev : Named num [1:6] attr(*, "names")= chr [1:6] "Comp.1" "Comp.2" "Comp.3" "Comp.4"... $ loadings: loadings [1:6, 1:6] attr(*, "dimnames")=list of 2....$ : chr [1:6] "Length" "Left" "Right" "Bottom" $ : chr [1:6] "Comp.1" "Comp.2" "Comp.3" "Comp.4"... $ center : Named num [1:6] attr(*, "names")= chr [1:6] "Length" "Left" "Right" "Bottom"... $ scale : Named num [1:6] attr(*, "names")= chr [1:6] "Length" "Left" "Right" "Bottom"... $ n.obs : int 200 $ scores : num [1:200, 1:6] attr(*, "dimnames")=list of 2....$ : NULL....$ : chr [1:6] "Comp.1" "Comp.2" "Comp.3" "Comp.4"... $ call : language princomp(x = bn[, -1], cor = FALSE) - attr(*, "class")= chr "princomp" > # pc$loadings pxp orthogonal matrix of ordered eigenvectors > # whose columns are the optimal normalized linear combinations > # pc$sdev standard deviations of pc scores (coincident with squared roots

5 3 PRINCIPAL COMPONENTS 5 > # of eigenvalues of covariance matrix) > # pc$scores n x p matrix of pc scores or coordinates (linear combination values) > summary(pc) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Comp.6 Standard deviation Proportion of Variance Cumulative Proportion > plot(pc) > # Statistical summaries of pc scores > round(colmeans(pc$scores), 2) > round(cov(pc$scores), 2) Comp Comp Comp Comp Comp Comp > round(cor(pc$scores), 2) Comp Comp Comp Comp Comp Comp > round(cor(bn[,-1], pc$scores[, 1:6]), 2) Length Left Right Bottom Top Diagonal > plot(pc$scores[,1:2], pch=20, col=col, + xlab= PC1 (66.8%), ylab= PC2 (20.8%), main= PCA of Swiss Banknotes Data ) > abline(h=0, v=0, lty= dotted, col= grey ) > text(pc$scores[,1:2], labels=c(1:, 1:), cex=0.6, pos=3) > pairs(pc$scores, pch=20, col=col)

6 3 PRINCIPAL COMPONENTS 6 PCA of Swiss Banknotes Data PC2 (20.8%) PC1 (66.8%) Figure 2: Swiss banknotes. Scatter plot of the first two principal components of centered data (black/red: genuine/forged bills). A discussion of results is given below. 1. The first two pc s provide a very good approximation of the 6-dimensional data: 66.8% of total variance is absorbed by the first pc and an additional 20.8% by the second one, corresponding to a cumulative value equal to 87.6%. Therefore the visualization of the sample on the cartesian plane of the first two pc s is a reliable picture of the original 6-dimensional configuration. 2. The scatter plot of the first two pc s shows some remarkable features. The two classes appear as separate swarms of points which is important because the information about class composition was NOT explicitly included in principal component transformation. Moreover, the elongated shape of both clusters (more accentuated fot forged bills) suggests within-class negative correlation of pc s scores (recall that, by definition, pc s scores are uncorrelated). Finally, outliers are clearly displayed (e. g., observations no. 5 and 70 from the class of genuine bills). 3. The correlation matrix of principal components with observed variables is the main tool to obtain an interpretation of the transformation. In the present case the first pc is correlated mainly with Bottom and Diagonal whereas the second pc is mainly correlated with Top. While this gives a clue about interpretation of principal axes, it also suggests that estimation of principal axes may be distorted by unbalanced variances of original variables. This can be the case here because these three variables have the highest variances. This is why it is generally recommended to apply PCA after data standardization. 3.2 Worked example: PCA of standardized Swiss notes data For the sake of completeness, we also study the pc transformation of standardized Swiss notes data. Note that in this case we look for the stationary points of the function b T Rb = p i,j=1 r ijb i b j, where R = (r ij ) is the correlation matrix of the data, subject to the constraint b T b = 1. The solution is given by the orthonormal eigenvectors of R and the corresponding eigenvalues.

7 3 PRINCIPAL COMPONENTS 7 PCA of Swiss Banknotes Data PC2 (21.3%) PC1 (49.1%) Standardized Data Figure 3: Swiss banknotes. Scatter plot of the first two principal components of standardized data (black/red: genuine/forged bills). > pc1 <- princomp(bn[, -1], cor=true) > summary(pc1) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Comp.6 Standard deviation Proportion of Variance Cumulative Proportion > plot(pc1) > # Statistical summaries of pc scores > round(colmeans(pc1$scores), 2) > round(cov(pc1$scores), 2) Comp Comp Comp Comp Comp Comp

8 3 PRINCIPAL COMPONENTS 8 > round(cor(pc1$scores), 2) Comp Comp Comp Comp Comp Comp > round(cor(bn[,-1], pc1$scores[, 1:6]), 2) Length Left Right Bottom Top Diagonal > plot(pc1$scores[,1:2], pch=20, col=col, + xlab= PC1 (49.1%), ylab= PC2 (21.3%), main= PCA of Swiss Banknotes Data, + sub= Standardized Data ) > abline(h=0, v=0, lty= dotted, col= grey ) > text(pc1$scores[,1:2], labels=c(1:, 1:), cex=0.6, pos=3) > pairs(pc1$scores, pch=20, col=col) For standardized data, we observe a drop in the value of variance explained by just the first two pc s: 70.4% against 87.6% obtained on centered data. Also, interpretation is different. The first pc heavily depends on all observed variables except Length. Correlations are positive except for Diagonal. The second pc depends almost only on Length. Again, class discrimination is good and there is evidence of positive within-class correlation of pc s scores. 3.3 Worked example: PCA of Swiss notes data augmented with perimeter and area of bills As a final application, we study the effect on pc transformation of inclusion of linear and non linear transformations of variables. Here we consider addition of perimeter and area of notes. > apply(bn1, 2, sd) Length Left Right Bottom Top Diagonal Perimeter Area > pc2 <- princomp(bn1, cor=true) > summary(pc2) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion

9 4 BEYOND PRINCIPAL COMPONENTS 9 Comp.6 Comp.7 Comp.8 Standard deviation e-03 0 Proportion of Variance e-07 0 Cumulative Proportion e+00 1 > round(cor(bn1, pc2$scores[, 1:8]), 2) Comp.7 Comp.8 Length Left Right Bottom Top Diagonal Perimeter Area > plot(pc2$scores[,1:2], pch=20, col=col, + xlab= PC1 (52.8%), ylab= PC2 (24.9%), main= PCA of Swiss Banknotes Data, + sub= Data set augmented with Perimeter and Area ) > abline(h=0, v=0, lty= dotted, col= grey ) > text(pc2$scores[,1:2], labels=c(1:, 1:), cex=0.6, pos=3) > pairs(pc2$scores, pch=20, col=col) The results show remarkable differences with respect previous versions. 1. Here it is necessary to apply pc transformation to standardized data because the variance of Area variable is clearly dominant. 2. The last eigenvalue is zero because the rank of the augmented data matrix is 7, not 8, as Perimeter is a linear transformation of a subset of the observed variables. 3. The cumulated variance explained by the first two pc is 77.7%, an intermediate value between the previous results and the cumulated variance explained by the first three pc is 88.6%, a very good value. 4. Let us try to interpret the first three pc, using the coorelations with observed variables. The first, and most important pc, mainly depends on Left, Right, Perimeter and Area (absolute correlations all higher than 0.8, highest value corresponding to Area), the correlations with the remaining variables being non negligible but clearly of minor importance. The second pc mainly depends on Length (correlation equal to 0.81), Diagonal (correlation equal to 0.68), Bottom, Top and Perimeter. The third pc can be interpreted as a contrast between Bottom (correlation equal to 0.55) and Top (correlation equal to 0.73). 5. Class separation remains good. Observe that genuine bills are generally above the line P C2 = P C1, that is, the bisector of third anf fourth quadrants. 4 Beyond principal components Taking linear combinations of observed variables is a powerful method to explore the multidimensional space. Principal components are characterized by the maximum variance property but different solutions can be obtained by changing the function to be optimized. Another two classical examples arise in multiple linear regression and discriminant analysis.

10 4 BEYOND PRINCIPAL COMPONENTS 10 PCA of Swiss Banknotes Data PC2 (24.9%) PC1 (52.8%) Data set augmented with Perimeter and Area Figure 4: Swiss banknotes. Scatter plot of the first two principal components from data augmented with area and perimeter of bills (black/red: genuine/forged bills). In multiple linear regression we are given p explanatory variables X 1,..., X p and a dependent variable Y and we look for the optimal linear predictor of Y based on X 1,..., X p. It turns out that the well-known least square solution is the linear combination of (centered) X 1,..., X p with maximum squared correlation with (centered) Y. In discriminant analysis we are given a partition of the n objects in G classes and we look for the linear combination of the observed features X 1,..., X p producing the best separation of classes. A criterion suggested in 19 by R. A. Fisher is to maximize the ratio of between-group variance to within group variance. Recall that in the scalar case the within group variance s 2 W is the weighted mean of the class variances and the between group variance s 2 B is the variance of the weighted class means about the overall mean. An important result is that the overall variance is identically equal to the sum of the betweengroup and the within-goup components. The resulting optimally separating linear combinations, called canonical variates, are related to linear discriminant analysis. We illustrate the canonical variate method using the Swiss banknotes data. > library(mass) > ld <- lda(scale(bn[, -1]), grouping=class) > str(ld) List of 8 $ prior : Named num [1:2] attr(*, "names")= chr [1:2] "0" "1" $ counts : Named int [1:2]..- attr(*, "names")= chr [1:2] "0" "1" $ means : num [1:2, 1:6] attr(*, "dimnames")=list of 2....$ : chr [1:2] "0" "1"....$ : chr [1:6] "Length" "Left" "Right" "Bottom"... $ scaling: num [1:6, 1] attr(*, "dimnames")=list of 2

11 4 BEYOND PRINCIPAL COMPONENTS 11 PCA and CVA of Swiss Banknotes Data CV PC1 Standardized Data Figure 5: Swiss banknotes. Scatter plot of first principal component (horizontal axis) and first canonical variate (vertical axis) obtained from standardized data (black/red: genuine/forged bills).....$ : chr [1:6] "Length" "Left" "Right" "Bottom" $ : chr "LD1" $ lev : chr [1:2] "0" "1" $ svd : num 49.1 $ N : int 200 $ call : language lda(x = scale(bn[, -1]), grouping = class) - attr(*, "class")= chr "lda" > # $scaling optimal linear combination(s) > cv <- scale(bn[, -1]) %*% ld$scaling > plot(pc1$scores[,1], cv, pch=20, col=col, + xlab= PC1, ylab= CV1, main= PCA and CVA of Swiss Banknotes Data, + sub= Standardized Data ) > abline(h=0, v=0, lty= dotted, col= grey ) > text(pc1$scores[,1], cv, labels=c(1:, 1:), cex=0.6, pos=3) > round(cor(bn[, -1], cv), 2) LD1 Length Left 0.52 Right 0.61 Bottom 0.80 Top 0.63 Diagonal > cor(pc1$scores[,1], cv) LD1 [1,]

12 4 BEYOND PRINCIPAL COMPONENTS 12 Some remarks are given below 1. It is clear that the canonical variate achieves optimal separation, with genuine bills assuming negative scores and forged bills assuming positive scores. 2. Interpretation is again obtainable from correlations with observed variables. Maximum absolute correlations of the canonical variate are with Diagonal ( 0.94) and Bottom (0.80). 3. In this case PCA and CVA produce similar results, as shown by the strong linear relation between the first principal component and the canonical variate. But in general it needs not be so.

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In