Canonical Correlation & Principle Components Analysis

Similar documents
Factor Analysis Continued. Psy 524 Ainsworth

An Introduction to Ordination Connie Clark

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Principal component analysis

REVIEW OF DIFFERENTIAL CALCULUS

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Chapter 4: Factor Analysis

Experimental Design and Data Analysis for Biologists

VAR2 VAR3 VAR4 VAR5. Or, in terms of basic measurement theory, we could model it as:

B. Weaver (18-Oct-2001) Factor analysis Chapter 7: Factor Analysis

Or, in terms of basic measurement theory, we could model it as:

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

PRINCIPAL COMPONENT ANALYSIS

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Factor analysis. George Balabanis

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Principal Component Analysis

Principal Component Analysis, A Powerful Scoring Technique

Exploratory Factor Analysis and Principal Component Analysis

How to Run the Analysis: To run a principal components factor analysis, from the menus choose: Analyze Dimension Reduction Factor...

e 2 e 1 (a) (b) (d) (c)

Exploratory Factor Analysis and Principal Component Analysis

Unconstrained Ordination

Principal Component Analysis (PCA) Theory, Practice, and Examples

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

7. Variable extraction and dimensionality reduction

An Introduction to Multivariate Statistical Analysis

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

The Matrix Algebra of Sample Statistics

This appendix provides a very basic introduction to linear algebra concepts.

THE PEARSON CORRELATION COEFFICIENT

Machine Learning 2nd Edition

REDUNDANCY ANALYSIS AN ALTERNATIVE FOR CANONICAL CORRELATION ANALYSIS ARNOLD L. VAN DEN WOLLENBERG UNIVERSITY OF NIJMEGEN

Vector Space Models. wine_spectral.r

Dimensionality Reduction Techniques (DRT)

6. Let C and D be matrices conformable to multiplication. Then (CD) =

REDUNDANCY ANALYSIS AN ALTERNATIVE FOR CANONICAL CORRELATION ANALYSIS ARNOLD L. VAN DEN WOLLENBERG UNIVERSITY OF NIJMEGEN

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

PRINCIPAL COMPONENT ANALYSIS OF MONTH-TO-MONTH PRECIPITATION VARIABILITY FOR NCDC CALIFORNIA CLIMATIC DIVISIONS, ( THROUGH SEASONS)

Data Preprocessing Tasks

Introduction to Machine Learning

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Logistic Regression: Regression with a Binary Dependent Variable

Ordination & PCA. Ordination. Ordination

w. T. Federer, z. D. Feng and c. E. McCulloch

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Reduction of Random Variables in Structural Reliability Analysis

Quantitative Understanding in Biology Principal Components Analysis

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Factor Analysis. Summary. Sample StatFolio: factor analysis.sgp

PCA vignette Principal components analysis with snpstats

Principal component analysis

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Principal Components Analysis (PCA)

Introduction to Confirmatory Factor Analysis

Ch.3 Canonical correlation analysis (CCA) [Book, Sect. 2.4]

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Central limit theorem - go to web applet

Basics of Multivariate Modelling and Data Analysis

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

REVIEW 8/2/2017 陈芳华东师大英语系

Machine learning for pervasive systems Classification in high-dimensional spaces

Exploratory Factor Analysis and Canonical Correlation

Factor Analysis (1) Factor Analysis

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

9.1 Orthogonal factor model.

Machine Learning (Spring 2012) Principal Component Analysis

Sociology 6Z03 Review II

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

4 Multiple Linear Regression

Investigating Models with Two or Three Categories

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

Bootstrapping, Randomization, 2B-PLS

Principal Component Analysis

PCA Advanced Examples & Applications

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Intermediate Social Statistics

Homework 2: Simple Linear Regression

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Performance In Science And Non Science Subjects

A Least Squares Formulation for Canonical Correlation Analysis

New Interpretation of Principal Components Analysis

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

Applied Multivariate Analysis

Statistical foundations

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

EE731 Lecture Notes: Matrix Computations for Signal Processing

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Final Exam

7.1 Basic Properties of Confidence Intervals

ISQS 5349 Final Exam, Spring 2017.

Hypothesis testing:power, test statistic CMS:

Transcription:

Canonical Correlation & Principle Components Analysis Aaron French Canonical Correlation Canonical Correlation is used to analyze correlation between two sets of variables when there is one set of IVs (independent variables) and one set of DVs (dependent variables). It is a descriptive rather than a hypothesis-testing procedure, and there are several ways data may be combined with this procedure. There are several types of questions that can be answered with Canonical Correlation. 1. How many reliable variate pairs are there in the data set?. How strong is the correlation between variates in a pair? 3. How are dimensions that relate the variables to be interpreted? Canonical Correlation is subject by several limitations. First, it is mathematically elegant but difficult to interpret because solutions are not unique. Second, the relationship between variables must be linear if the data is correlated in nonlinear ways then other analysis would be more appropriate. And finally, small changes in which variables are included in the analysis can cause large differences in the results, and this can further confound the interpretation. Normality is not required to perform Canonical Correlation, but it does increase the power of the test. As is mentioned above, linear relationships between variables is essential. Additionally, homoscedasticity (relatively equal variances) increases the power of the test. Canonical Correlation is very sensitive to missing data in the analyzed matrix and to outliers. This must be tested for and resolved before performing a Canonical Correlation. The general equations for performing a canonical correlation are relatively simple. First, a correlation matrix (R) is formed. This is composed of: correlations between DVs (R yy ), correlations between IVs (R xx ), and correlations between DVs and IVs (R xy ). 1 1 R = Ryy RyxRxx Rxy

For canonical analysis solve the above equation for eigenvalues and eigenvectors of the matrix R. Eigenvalues consolidate the variance of the martix, redistributing the original variance into a few composite variates. Eigenvectors, transformed into coefficients, are used to combine the original variables into these composites. The eigenvalues are related to the canonical correlation by the following equation: λ i = r ci That is, each eigenvalue equals the squared canonical correlation for each pair of variates Significance tests follow the following formula, and conform to a chi-squared distribution χ k x + k y + 1 = N 1 ln Λ m Λ m = m i= 1 ( 1 λ ) i with, N = number of cases k x = number of variables in IV set k y = number of variables in DV set DF = (k x )(k y ) m = number of canonical correlations Significant results indicate that the overlap between variables in each set is significant this is evidence of significance in the first canonical coefficient. The next step is to remove canonical correlates and repeat the calculations. Two sets of canonical coefficients are required for each canonical correlation - one to combine DVs and one to combine IVs. For DVs the equation is: B y ( 1/ ) = R yy By

For IVs: B x = R 1/ xx R xy B * y B y = normalized matrix of eigenvectors; R = matrix of correlations The two matrices of canonical coefficients are used to estimate scores on canonical variates: X = Z x B x Y = Z y B y Scores on canonical variates (X, Y) are the product of scores of original variates and the canonical coefficients used to weight them. The sum of canonical scores for each variate is equal to zero. Loading matrices (A) are created by multiplying the matrix of correlations between variables with the matrix of canonical coefficients. These A matrices are used to interpret the canonical variates. A x = R xx B x A y = R yy B y How much variance does each canonical variate explain? The proportion of variance for IVs: pv xc k = x i= 1 a k ixc x pv yc = k y i= 1 a iyc k y a = loading correlations k = number of variables in the set How much variance does the canonical variate from the IVs extract from DVs, and vice versa? Redundancy = proportion of variance * canonical correlation squared rd = ( pv)( r c )

Principle Components Analysis PCA is a type of factor analysis in which a new factor is created for each variable in the data set. It is used to determine the relationships among variables in situations where it is not appropriate to make a priori grouping decisions, i.e. the data is not split into dependent and independent groups. Groups are created by forming composite axis which maximize the overall distance between the data. In other words, it determines the net effect of each variable on the total variance of the data set, and then extracts the maximum variance possible from the data. PCA is used in situations where you would like to simplify the description of a set of many related variables. It can also be useful as a preliminary step in a complicated regression analysis. In this case, first run a PCA which decreases the number of important variables, and then a regression can be performed on the principle components. Another useful feature of PCA is that the data does not need to be normalized before performing the analysis. In fact, PCA can sometimes be a useful tool for determining normality. If the data set is large and complex it can be easier to graph and assess the principle components than the original data if the principle components are normally distributed then so are the original variables. The same procedure can be used to look for outliers - a histogram of the principle components can identify large or small values in a much simplified fashion. If outliers are found, the data must be transformed or the outliers removed. There are several situations that are not appropriate for a PCA. PCA can not be used where results are pooled across several samples, not for a repeated measures design. This is because the underlying structure of the data set may shift across samples or across time, and the principle components analysis does not allow for this. Additionally, PCA is very sensitive to the sizes of correlations between variables, and any non-linearity between pairs of variables. Missing data is also not acceptable. Relatively large sample sizes are also needed. For data sets with about 4-6 variables a sample size of at least 300 data points is needed. There are several potential problems with performing a PCA. Most importantly, there are no criteria against which to test output. Therefore, there is no way of testing the integrity of the groupings. This can create apparent order from real chaos in the data. There is no guarantee that the results will yield any biologically significant information. To perform a PCA the data is arranged in a correlation matrix (R) and then diagonalized. The diagonalized matrix (L) has numbers on the positive diagonal, 0 s everywhere else. The correlation matrix (R) is diagonalized with the equation: L = V RV

Where the diagonalized matrix (L), also known as the eigenvalue matrix, is created by pre- and postmultiplying the R matrix by the eigenvector matrix (V) and its transpose. Calculations of eigenvectors and eigenvalues are the same as for canonical correlation. Eigenvalues are related to the variance in the matrix. In an example where there are 10 factors, on average each factor will have an eigenvalue of 1 and will explain 10% of the variation in the data. A factor with an eigenvalue of explains twice the variance of an average variable, or 0% in the example. The square root is then taken of the eigenvalue matrix (L). The product of the eigenvector and the square root of the eigenvalue matrix is called the factor loading matrix (A). R = (VL 1/ )(L 1/ V ) = AA The correlation matrix (R) is the product of the factor loading matrix (A) and its transpose (A ), each a combination of eigenvectors and square roots of eigenvalues. The factor loading matrix represents the correlation between each factor and each variable, and is one of the commonly cited and referred to variables. Factor score coefficients, which are similar to regression coefficients calculated in a multiple regression, are equal to the product of the inverse of correlation matrix and the factor loading matrix. B = R -1 A After extraction of the principle component variables, those variables which explain a low amount of variance in the data set may be discarded. There are no set rules as to how many principle component variables should be retained. One rule of thumb is to only accept components which explain over 80% of the variance. Another more generous possibility is to accept all factors which explain a non-random percentage of the variance (i.e. over 5%) A Scree Test can be useful in determining the total amount of variance explained by each component. A scree test is simply a graph with the principle components on the x-axis and eigenvalues on the y-axis. Principle components are included by drawing a line from the first component, and looking for a place where the eigenvalue points change slope off that line. After principle component factors are extracted from the data, factor loading A matrices are rotated to maximize high correlations in the data while simultaneously minimizing low correlations. There are many types of rotation,

but one of the most commonly recommended is Varimax Rotation, a variance maximizing procedure. Rotation, then, is simply a way of clarifying the patterns which are already present in the data. In varimax rotation, the unrotated loading matrix is multiplied by a transformation matrix A unrotatedλ= A rotated The transformation matrix has a spatial interpretation: cos Ψ Λ= sin Ψ The only remaining question now is What do the various principle component factors mean? This can be a little ambiguous sometimes with a complex data set, but often it is possible to equate each principle component factor with a combination of several original variables which they relate to the most. They can also remain unnamed and simply be refereed to by their factor names (PC1, PC, etc.). This is advisable in situations where confusion might arise from a misclassification of their associated variables. Reference Literature sin Ψ cos Ψ For a more in depth look at Canonical Correlation, the best reference book that I am aware of is: Tabachnick, B. G., and L. S. Fidell (1996). Using Multivariate Statistics. 3 rd Edition. HarperCollins College Publishers. Several other books are good secondary references, including: Afifi, A. A., and V. Clark. (1996) Computer-aided Multivariate Analysis. 3 rd Edition. Chapman and Hall. Cooley, W.W. and D.R. Lohnes. 1971. Multivariate data Analysis. John Wiley &Sons, Inc. A very good ecological paper describing the uses of Canonical Correlation is: Calhoon, R. E., and D. L. Jameson. (1970) Canonical correlation between variation in weather and variation in size in the Pacific Tree Frog, Hyla regilla, in Southern California. Copea 1: 14-144.

For a paper with a good ecological application of PCA, see: Alisauskas, R.T. (1998) Winter range expansion and relationships between landscape and morphometrics of midcontinent lesser snow geese. Auk 115: 851-86.