Dimension Reduction and Classification Using PCA and - A Short Overview Laboratory for Interdisciplinary Statistical Analysis Department of Statistics Virginia Tech http://www.stat.vt.edu/consult/ March 2, 2009
Outline - Disussions Difference between and
The Problem What to do when you have too many predictors in a model? For example you have expression level data for 1000 genes! Or you have customer attributes in hundreds and you are interested in making a predictive model based on customer attributes! Or you have second by second stock market data over a trading day for stocks! Or in survey data where multiple questions might capture the same kind of information (highly correlated)
The Cars Outline A researcher wants to build a model to find out which variables are most significant in predicting the demand for cars but believes that a lot of variables have high correlation and the study can be effectively done on a small number of variables without losing much information.
The Problem Outline Given a data set with N observations like X = (x 1,..., x p ) for a very large p. Figure: Data with 11 possible predictors
The Problem Outline How do we reduce the number of columns in X but still not throw away too much information?
The Problem Outline JMP Analyze Multivariate Methods Principal Components Mutlivariate(Tab) Scatterplot Matrix
The Problem Notice the highly correlated variables! We will attempt to explain most of the variability in the data, but use a small number of principal components (parsimony) if it is possible.
The Geometric Interpretation We intend to come up with rotations and projections in p dimensions that captures most of the variability. Figure: Plot of in three dimensions
The Geometric Interpretation - Eigens We can write the principal components as: Y 1 = a 1 X... Y p = a p X such that the Y s are uncorrelated and the variances for each Y is as large as possible. We find out the eigenvalues λ of the data matrix and rank them in terms of their size. The a s are obtained from the corresponding eigenvectors and the eigenvalues correspond to corresponding variances. Since Total population Variance = λ 1 + + λ p Variance explained by the k th principal component = λ k λ 1 + +λ p
The Geometric Interpretation - Eigens Summary Principal components are determined by our predictors There is a principal component for every eigenvalue The value of the eigenvalue gives a measure of much variation the corresponding principal component explains
The Geometric Interpretation - Eigens Summary By choosing the first few principal components (and hence eigenvalues) we might be able to explain a lot of the variation among the predictors (not all!) Hence we throw away some information but hopefully not much
The Cars We have data about 387 cars with the following variables Suggested Retail Price Invoice price Engine Size (liters) Number of Cylinders (=-1 if rotary engine) Horsepower City Miles Per Gallon Highway Miles Per Gallon Weight (Pounds) Wheel Base (inches) Length (inches) Width (inches)
The Cars Again A researcher wants to build a model to find out which variables are most significant in predicting the demand for cars but believes that a lot of variables have high correlation and the study can be effectively done on a small number of variables without losing much information. But how to choose a fewer number of predictors? Analysis!
The Cars Outline Use JMP Analyze Multivariate Methods Principal Components
The Cars Outline Let us first look at the correlations between the variables. Figure: Correlations
The Cars Outline What about the principal components? Can we interpret them? Figure:
The Cars Outline How many principal components do we need? How much of the variation is explained?
Key Points Principal components are functions of the predictors The first few principal components can give us almost all the information in terms of the variability in the data
- Discussion To reduce the number of predictors As a first step for a predictive model where we would like to remove correlated variables General dimension reduction - expecting a low dimensional structure where higher dimensions are basically noise
- Disussions Difference between and The Problem Sometimes inherent structure of the data motivates the researcher to group the data based on some unseen underlying factors. This inherent structure can be identified through the correlation matrix of X.
- Disussions Difference between and The Subject Scores Problem Consider examination scores in 6 subjects for 220 male students. The 6 subjects are Latin, English, History, Arithmetic, Algebra and Geometry. Consider the correlation matrix for the scores. 1.000.439 1.000.410.351 1.000.288.354.164 1.000.329.320.190.595 1.000.248.329.181.470.464 1.000
- Disussions Difference between and The Problem The researcher believes that the subject scores will be correlated amongst themselves in groups. A possible hypothesis might be that there are probably two underlying factors for the students scores - a factor that captures the liberal arts scores and another that captures the science scores. But how to verify such a hypothesis?!
- Disussions Difference between and Factor Loadings For our problem the researcher thinks that there are two underlying factors. The underlying factors correspond to two different loadings on the 6 subjects. Latin = L 11 F 1 + L 12 F 2 + ɛ 1 English = L 21 F 1 + L 22 F 2 + ɛ 2... Geometry = L 61 F 1 + L 62 F 2 + ɛ 6 The loadings L s will hopefully help us interpret the factors.
- Disussions Difference between and The Approach Data has underlying factors researcher determines number of factors factor loadings to be obtained through the covariance matrix researcher interprets factors based on loadings
- Disussions Difference between and Factor Loadings for the Subject Scores Variable F 1 F 2 Communalities Latin.553.429.490 English.568.288.406 History.392.450.356 Arithmetic.740 -.273.623 Algebra.724 -.211.569 Geometry.595 -.132.372 The factor loadings do not give us any immediately identifiable groups or factor interpretation. Or DOES it? Communalities give a measure of how much of the variance of the variable is explained by the factor structure.
- Disussions Difference between and Factor Loadings Plot Figure: Plot of factor loadings with two factors for the scores example
- Disussions Difference between and The Factor Rotation The factors are not immediately identifiable What do we do now? Factor structure in terms of variance explained remains unchanged if we rotate the factors Lets rotate and see if the factor loadings become interpretable
- Disussions Difference between and Rotated Factor Loadings for the Subject Scores Variable F 1 F 2 Communalities Latin.369.594.490 English.433.467.406 History.211.558.356 Arithmetic.789 -.001.623 Algebra.752 -.054.569 Geometry.604 -.083.372 Rotation makes the two factors immediately identifiable
- Disussions Difference between and Rotated Factor Loadings Plot Figure: Plot of factor loadings with two factors for the scores example
- Disussions Difference between and Approach - Summary Decide on number of factors Obtain factor loadings for the variables Interpret factors If interpretation not obvious rotate factors and check loadings again
- Disussions Difference between and Psychometrics, Psychology, human factors - identify factors that explain a variety of results on different tests Marketing - Identify the salient attributes consumers use to evaluate products in this category. Physical sciences, geochemistry, ecology, and hydrochemistry
- Disussions Difference between and Differences Principal components capture most of the variability in data by using fewer dimensions that where the data exists Hence the principal components lie in the same space as data Factor analysis conceptually tries to search for underlying but unobserved factors that define the correlation in the data Hence factors lie in a different space than the data
Richard Johnson, Dean Wishern - Applied Multivariate Statistical Analysis, 5e