Canonical Correlation & Principle Components Analysis

Size: px

Start display at page:

Download "Canonical Correlation & Principle Components Analysis"

Beatrix Hampton
6 years ago
Views:

1 Canonical Correlation & Principle Components Analysis Aaron French Canonical Correlation Canonical Correlation is used to analyze correlation between two sets of variables when there is one set of IVs (independent variables) and one set of DVs (dependent variables). It is a descriptive rather than a hypothesis-testing procedure, and there are several ways data may be combined with this procedure. There are several types of questions that can be answered with Canonical Correlation. 1. How many reliable variate pairs are there in the data set?. How strong is the correlation between variates in a pair? 3. How are dimensions that relate the variables to be interpreted? Canonical Correlation is subject by several limitations. First, it is mathematically elegant but difficult to interpret because solutions are not unique. Second, the relationship between variables must be linear if the data is correlated in nonlinear ways then other analysis would be more appropriate. And finally, small changes in which variables are included in the analysis can cause large differences in the results, and this can further confound the interpretation. Normality is not required to perform Canonical Correlation, but it does increase the power of the test. As is mentioned above, linear relationships between variables is essential. Additionally, homoscedasticity (relatively equal variances) increases the power of the test. Canonical Correlation is very sensitive to missing data in the analyzed matrix and to outliers. This must be tested for and resolved before performing a Canonical Correlation. The general equations for performing a canonical correlation are relatively simple. First, a correlation matrix (R) is formed. This is composed of: correlations between DVs (R yy ), correlations between IVs (R xx ), and correlations between DVs and IVs (R xy ). 1 1 R = Ryy RyxRxx Rxy

2 For canonical analysis solve the above equation for eigenvalues and eigenvectors of the matrix R. Eigenvalues consolidate the variance of the martix, redistributing the original variance into a few composite variates. Eigenvectors, transformed into coefficients, are used to combine the original variables into these composites. The eigenvalues are related to the canonical correlation by the following equation: λ i = r ci That is, each eigenvalue equals the squared canonical correlation for each pair of variates Significance tests follow the following formula, and conform to a chi-squared distribution χ k x + k y + 1 = N 1 ln Λ m Λ m = m i= 1 ( 1 λ ) i with, N = number of cases k x = number of variables in IV set k y = number of variables in DV set DF = (k x )(k y ) m = number of canonical correlations Significant results indicate that the overlap between variables in each set is significant this is evidence of significance in the first canonical coefficient. The next step is to remove canonical correlates and repeat the calculations. Two sets of canonical coefficients are required for each canonical correlation - one to combine DVs and one to combine IVs. For DVs the equation is: B y ( 1/ ) = R yy By

3 For IVs: B x = R 1/ xx R xy B * y B y = normalized matrix of eigenvectors; R = matrix of correlations The two matrices of canonical coefficients are used to estimate scores on canonical variates: X = Z x B x Y = Z y B y Scores on canonical variates (X, Y) are the product of scores of original variates and the canonical coefficients used to weight them. The sum of canonical scores for each variate is equal to zero. Loading matrices (A) are created by multiplying the matrix of correlations between variables with the matrix of canonical coefficients. These A matrices are used to interpret the canonical variates. A x = R xx B x A y = R yy B y How much variance does each canonical variate explain? The proportion of variance for IVs: pv xc k = x i= 1 a k ixc x pv yc = k y i= 1 a iyc k y a = loading correlations k = number of variables in the set How much variance does the canonical variate from the IVs extract from DVs, and vice versa? Redundancy = proportion of variance * canonical correlation squared rd = ( pv)( r c )

4 Principle Components Analysis PCA is a type of factor analysis in which a new factor is created for each variable in the data set. It is used to determine the relationships among variables in situations where it is not appropriate to make a priori grouping decisions, i.e. the data is not split into dependent and independent groups. Groups are created by forming composite axis which maximize the overall distance between the data. In other words, it determines the net effect of each variable on the total variance of the data set, and then extracts the maximum variance possible from the data. PCA is used in situations where you would like to simplify the description of a set of many related variables. It can also be useful as a preliminary step in a complicated regression analysis. In this case, first run a PCA which decreases the number of important variables, and then a regression can be performed on the principle components. Another useful feature of PCA is that the data does not need to be normalized before performing the analysis. In fact, PCA can sometimes be a useful tool for determining normality. If the data set is large and complex it can be easier to graph and assess the principle components than the original data if the principle components are normally distributed then so are the original variables. The same procedure can be used to look for outliers - a histogram of the principle components can identify large or small values in a much simplified fashion. If outliers are found, the data must be transformed or the outliers removed. There are several situations that are not appropriate for a PCA. PCA can not be used where results are pooled across several samples, not for a repeated measures design. This is because the underlying structure of the data set may shift across samples or across time, and the principle components analysis does not allow for this. Additionally, PCA is very sensitive to the sizes of correlations between variables, and any non-linearity between pairs of variables. Missing data is also not acceptable. Relatively large sample sizes are also needed. For data sets with about 4-6 variables a sample size of at least 300 data points is needed. There are several potential problems with performing a PCA. Most importantly, there are no criteria against which to test output. Therefore, there is no way of testing the integrity of the groupings. This can create apparent order from real chaos in the data. There is no guarantee that the results will yield any biologically significant information. To perform a PCA the data is arranged in a correlation matrix (R) and then diagonalized. The diagonalized matrix (L) has numbers on the positive diagonal, 0 s everywhere else. The correlation matrix (R) is diagonalized with the equation: L = V RV

5 Where the diagonalized matrix (L), also known as the eigenvalue matrix, is created by pre- and postmultiplying the R matrix by the eigenvector matrix (V) and its transpose. Calculations of eigenvectors and eigenvalues are the same as for canonical correlation. Eigenvalues are related to the variance in the matrix. In an example where there are 10 factors, on average each factor will have an eigenvalue of 1 and will explain 10% of the variation in the data. A factor with an eigenvalue of explains twice the variance of an average variable, or 0% in the example. The square root is then taken of the eigenvalue matrix (L). The product of the eigenvector and the square root of the eigenvalue matrix is called the factor loading matrix (A). R = (VL 1/ )(L 1/ V ) = AA The correlation matrix (R) is the product of the factor loading matrix (A) and its transpose (A ), each a combination of eigenvectors and square roots of eigenvalues. The factor loading matrix represents the correlation between each factor and each variable, and is one of the commonly cited and referred to variables. Factor score coefficients, which are similar to regression coefficients calculated in a multiple regression, are equal to the product of the inverse of correlation matrix and the factor loading matrix. B = R -1 A After extraction of the principle component variables, those variables which explain a low amount of variance in the data set may be discarded. There are no set rules as to how many principle component variables should be retained. One rule of thumb is to only accept components which explain over 80% of the variance. Another more generous possibility is to accept all factors which explain a non-random percentage of the variance (i.e. over 5%) A Scree Test can be useful in determining the total amount of variance explained by each component. A scree test is simply a graph with the principle components on the x-axis and eigenvalues on the y-axis. Principle components are included by drawing a line from the first component, and looking for a place where the eigenvalue points change slope off that line. After principle component factors are extracted from the data, factor loading A matrices are rotated to maximize high correlations in the data while simultaneously minimizing low correlations. There are many types of rotation,

6 but one of the most commonly recommended is Varimax Rotation, a variance maximizing procedure. Rotation, then, is simply a way of clarifying the patterns which are already present in the data. In varimax rotation, the unrotated loading matrix is multiplied by a transformation matrix A unrotatedλ= A rotated The transformation matrix has a spatial interpretation: cos Ψ Λ= sin Ψ The only remaining question now is What do the various principle component factors mean? This can be a little ambiguous sometimes with a complex data set, but often it is possible to equate each principle component factor with a combination of several original variables which they relate to the most. They can also remain unnamed and simply be refereed to by their factor names (PC1, PC, etc.). This is advisable in situations where confusion might arise from a misclassification of their associated variables. Reference Literature sin Ψ cos Ψ For a more in depth look at Canonical Correlation, the best reference book that I am aware of is: Tabachnick, B. G., and L. S. Fidell (1996). Using Multivariate Statistics. 3 rd Edition. HarperCollins College Publishers. Several other books are good secondary references, including: Afifi, A. A., and V. Clark. (1996) Computer-aided Multivariate Analysis. 3 rd Edition. Chapman and Hall. Cooley, W.W. and D.R. Lohnes Multivariate data Analysis. John Wiley &Sons, Inc. A very good ecological paper describing the uses of Canonical Correlation is: Calhoon, R. E., and D. L. Jameson. (1970) Canonical correlation between variation in weather and variation in size in the Pacific Tree Frog, Hyla regilla, in Southern California. Copea 1:

7 For a paper with a good ecological application of PCA, see: Alisauskas, R.T. (1998) Winter range expansion and relationships between landscape and morphometrics of midcontinent lesser snow geese. Auk 115:

Factor Analysis Continued. Psy 524 Ainsworth

Factor Analysis Continued. Psy 524 Ainsworth Factor Analysis Continued Psy 524 Ainsworth Equations Extraction Principal Axis Factoring Variables Skiers Cost Lift Depth Powder S1 32 64 65 67 S2 61 37 62 65 S3 59 40 45 43 S4 36 62 34 35 S5 62 46 43