Principal Component Analysis

Size: px

Start display at page:

Download "Principal Component Analysis"

Ralf Thompson
6 years ago
Views:

1 Principal Component Analsis L. Graesser March 1, 1 Introduction Principal Component Analsis (PCA) is a popular technique for reducing the size of a dataset. Let s assume that the dataset is structured into an m x n matrix where each row represents a data sample and each column represents a feature. Reducing the size of the dataset is achieved b first linearl transforming the dataset, retaining the original dimensions, and then selecting the first 1 : k, (k < n) columns of the resulting matrix. PCA is useful for three reasons. First, it reduces the amount of memor a dataset takes up with the minimal loss of information, and so can significantl speed up the runtime of learning algorithms. Second, it reveals the underling structure of the data, and in particular the relationship between features. It reveals how man uncorrelated relationships there are in a dataset. This is the rank of a dataset, the maximum number of linearl independent column vectors. More formall, PCA re-expresses the features of a dataset in an orthogonal basis of the same dimension as the original features. This has the nice propert that the new features are uncorrelated. That is, the new set of features are an orthogonal set of vectors. Finall, PCA can help to prevent a learning algorithm from overfitting the training data. This objective of this paper is to explain what PCA is and to explore when it is and is not useful for data analsis. Section one explains the mathematics of PCA. It starts with the desirable properties of the transformed dataset and works through the mathematics which guarantee these properties. It is intended to be understood b a reader who has a basic understanding of linear algebra, but can be skipped if readers wish onl to see the application of PCA to datasets. Section two uses to datasets to demonstrate what happens to the principal components and the accurac of a simple principal component regression when the variance of features change. Section three explores the trade off between dimensionalit reduction using PCA and the performance of a linear model. It compares the performance of linear regression, ridge regression and principal component regression in predicting the median household income of US counties. The objective of this section is to show how the accurac of principal component regression changes as the number of principal components (dimensions) is reduced, and to assess how effective PCA is in preventing overfitting when compared to ridge regression 1. The paper ends with a summar of the results and a brief discussion of possible extensions. A separate appendix is provided which contains the detailed results of the median income analsis and Matlab code for a number of functions that I found useful to write during this process is available on github at 1 regularized linear regression 1

2 Section 1: The Mathematics of Principal Component Analsis Let X be an m x n matrix. Each row represents an individual data sample, so there are m samples, and each column represents a feature, so there are n features. Let X = [x 1,..., x n ], where x i is a column vector R m. Principal Component Analsis is the process of transforming a dataset, X, into a new dataset Y = [ 1,..., n ], i R m. The objective is to find a transformation matrix, P, such that Y = XP, and which ensures that Y has the following properties. (A) Each feature, i, is uncorrelated with all other features. That is, the covariance 3 between features is. Let µ i = 1 m m j=1 j i, the mean of vector i, then the covariance between two features, i, and j is as follows. Cov i, j = ( i µ i ) T ( j µ j ) m 1 = T i j m 1 = i j Note: the rightmost equalit holds if and onl if i and j are mean normalized, which b assumption the are. See footnote. (B) The features are ordered from highest to lowest variance, left to right. 1 should have the highest variance, n the lowest. This will turn out to be essential when reducing the dimensionalit of the dataset Y b choosing the first k < n columns of Y. (C) The total variance of the original dataset is unchanged, that is X and Y have the same total variance. The columns of Y are then called the principal components of X. Mathematicall, Principal Component Analsis is ver elegant. At its heart is the propert that all smmetric matrices are orthogonall diagonalizable 5. This means that for an smmetric n x n matrix, A, there exists an orthogonal matrix E = [e 1,..., e n ], and a diagonal matrix, D, such that, Throughout this derivation, the resulting matrix, Y, is assumed to be in mean deviation form. This means that the mean of each i Y has been subtracted from each element of i. This guarantees that the mean of each i Y is. 3 a measure of how much changes in one variable is associated with changes in another variable) between each feature A smmetric matrix is a matrix where A = A T 5 The Spectral Theorem of Smmetric Matrices, La An orthogonal matrix is a matrix in which the norm (length) of each of its columns = 1, and ever column vector is orthogonal to ever other column vector. Let E = [e 1,..., e n ] be an orthogonal matrix, then, e T i e j = i j and, e T i e j = 1 i = j

3 3 A = EDE 1 = EDE T, since E is orthogonal and square and so E T E = I The transformation matrix, P, will be derived in two was. First, it is derived using the fact that all smmetric matrices are orthogonall diagonalizable. Second, it is derived using the Singular Value Decomposition (SVD). Looking at both derivations is interesting because it highlights the relationship between the eigenvalues of X T X, the variance of the feature vectors, and the Singular Value Decomposition, one of the most widel used matrix decompositions. In practice, it is often easiest to find P and the resulting Y using the SVD. Derivation 1: Let Y = XP, and suppose that Y is mean-normalized. To satisf propert (A), the covariance of the columns of Y need to be zero, which is equivalent to requiring that Y T Y be a diagonal matrix. X T X is a smmetric matrix (since (X T X) T = X T X T T = X T X), so there exists an orthogonal matrix, E, and a diagonal matrix, D, such that X T X = EDE T. So, we have Let P = E, then Y T Y = (XP T ) T (XP ) = P T X T XP = P T EDE T P Y T Y = E T EDE T E From this, it is clear that propert (A) is satisfied. Let (d 1,..., d n ) be the diagonals of D. Then, we have that, 1 T ( ) Y T Y = n = D (1) = T n 1 T 1 1 T... 1 T n... n T 1 n T... n T n d 1... d... =... ()... d n The covariance of features i and j, i j, is zero. Which was what we were looking for. The fact that P is an orthogonal matrix also guarantees that propert (C) is satisfied,

4 since an orthogonal change of variables does not change the total variance of the data 7. At this point we know that a matrix, P, exists such that Y = XP will satisf properties (A) and (C). But we do not et know if propert (B) is satisfied. The second derivation using the Singular Value Decomposition will make this clear. In this next section it is useful to keep in mind the properties of matrices E and D from this derivation. The diagonals of D contain the distinct eigenvalues of X T X, and the columns of E are the corresponding eigenvectors 9, which b definition are an orthogonal set since an two eigenvectors corresponding to distinct eigenvalues (distinct eigenspaces) of a smmetric matrix are orthogonal 1. Derivation : The following theorem 11 will be crucial to guaranteeing that propert (B) is satisfied as it is closel related to the variance of the new features vectors in Y. Let A be an matrix, let x be an unit vector of appropriate dimension, and let λ 1 be the greatest eigenvalue of A T A. Then, the maximum length that Ax can have is, max Ax = max((ax) T Ax)) 1 = max (x T A T Ax) 1 = λ 1 1 The Singular Value Decomposition is an extremel useful matrix decomposition since it states that an m x n matrix A can be decomposed as follows, A = UΣV T U is an orthogonal m x m matrix V is an orthogonal n x n matrix ( ) D Σ = Note: if Σ is square, the bottom right -block is excluded (3) Where Σ is m x n, D is a diagonal r x r matrix, r = Column rank (A) = rank(a), and where the diagonals of D are the first r singular values of A ordered from largest to smallest, left to right, that is, σ 1 σ... σ r >, where σ i is a non zero diagonal entr of D. 7 Linear Algebra and its Applications, La, pg 1 An eigenvalue is a scalar, λ, such that Ax = λx, where x is non zero. 9 An eigenvector x is a non zero vector such that, Ax = λx, where A is an m x n matrix, λ is a scalar. Eigenvectors are said to correspond to eigenvalues. 1 The Spectral Theorem of Smmetric Matrices 11 Linear Algebra and its Applications, La, pg

5 5 Let X = UΣV T and substituting X = UΣV T into Y T Y = P T X T XP, we have, Y T Y = P T X T XP = P T (UΣV T ) T (UΣV T )P = P T V Σ T U T UΣV T P = P T V Σ V T P Let P = V, then Y T Y = V T V Σ V T V = Σ () From this, we know that, Squared singular values of X = Σ Singular vectors of X = V = D = eigenvalues of X T X (after ordering the eigenvalues largest to smallest, left to right) = E = orthogonal set of eigenvectors of X T X (ordered to correspond to the ordered eigenvalues of D) (5) Now we can prove that this transformation P, does in fact guarantee that propert (B) is satisfied. We know that, X = UΣV T Y = XP = XV = UΣV T V = UΣ = [σ 1 u 1,..., σ r u r,..., ] () i = σ i u i Now, consider the variance of the first feature of Y, 1, V ar 1 = = m (( 1 µ 1 ) ) i=1 m 1 i=1 since Y is mean normalized = T 1 1 = (σ 1 u 1 ) T (σ 1 u 1 ) (7) = u T 1 σ1u 1 = σ1u T 1 u 1 since σ 1 is a scalar = σ1 since U is orthogonal so u 1 is a unit vector = λ 1, the largest eigenvector of X T X

6 Recall that the maximum value that Xa can take for an appropriatel dimensioned unit vector a, is λ 1, the largest eigenvalue of X T X. So it is not possible for the variance of an of the -vectors to be larger that that of 1. And, since the transformation matrix, V, preserves the variance of X, we know that 1 is the vector representing the maximal direction variance of the features (columns) of matrix X. From an extension of the same theorem 1, it can be established that is the vector representing the direction of the second highest variance of the matrix X, constrained b that fact that 1 and are orthogonal. And more generall, i is the vector representing the direction of maximum variance of X given that i and j are orthogonal j < i. So, the feature vectors (columns) of Y are ordered from the direction of the highest variance in X to the lowest variance, from left to right. It is now clear that our transformed dataset Y = XV, where V is the matrix of right singular vectors of the Singular Value Decomposition of X, has all of the properties that we were looking for. Having obtained our matrix of principal components, Y, we can now reduce the dimensionalit of Y simpl b selecting the leftmost k < n columns, knowing that these features contain the maximum possible information (variance) about the original dataset X, for an k < n. The proportion of the variance of the original data captured in these k columns is calculated b the sum of the first k singular values of X divided b the sum of all of the singular values of X. How man principal components to choose depends on the application and the dataset. We will see an example of this in practice in Section 3. Section : PCA applied: changing variable variance in principal component regression The following simple examples demonstrate how the principal components of a dataset and the performance of a principal component regression 13 changes as the variance of the features change. To show this I created two datasets, which each contained five datasets. Each dataset contained two features, x 1 and x R 1, and one target variable, R 1. All features were mean normalized. In ever dataset, x 1 is a strong predictor of, and a linear model, = β + β 1 x 1 has an R of 97 9% for each dataset 1. In the first set of five datasets, x is also correlated with, but the feature contains more noise 15, and so is a less good predictor of. The relationship between x 1, x, and is displaed in the charts below. 1 Linear Algebra and its Applications, La, pg 3 13 Principal component regression is simpl linear regression using the first k n columns of the transformed dataset Y = XV to predict a target variable 1 R is the percentage of variance of the target variable explained b the model. It is calculated as follows: Total sum of squares (TSS) = i ( i ȳ). Explained sum of squares (ESS) = i (f i ȳ). Residual sum of squares (RSS) = i ( i f i ) R = 1 RSS/T SS. 15 modeled b increasing the variance of x in a wa that is uncorrelated with

7 7 Dataset 1: vs. x Dataset : vs. x Dataset 3: vs. x Dataset : vs. x Dataset 5: vs. x x x x x x Dataset 1: vs Dataset : vs Dataset 3: vs Dataset : vs Dataset 5: vs Dataset 1: x vs Dataset : x vs Dataset 3: x vs Dataset : x vs Dataset 5: x vs x x x x x As the variance of x increases, the accurac 1 of a principal component regression using just the first principal component falls, despite the fact that is best described as a function of one variable. Whilst principal component regression does not perform as poorl as the linear model = β + β 1 x (see table below), it is significantl worse than the linear model = β + β 1 x 1. This is because the maximal direction of variance, the first principal component, in each dataset is a linear combination of the both x 1 and x, with the relative weight of x increasing as the variance of x increases (see table 1 below). So, the presence of the high variance but less good predictor, x, makes the model worse. 1 measured b R

8 Table 1: Principal Component 1: x 1 and x coefficients Model x 1 coefficient x coefficient Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X =.5. Var X1 = 9, X = Var X1 = 5, X = Table : Results (R ): X, X1 correlated with, X variance increased Model PCR: comps PCR: 1 comp = β + β 1 x 1 = β + β 1 x Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X = Var X1 = 9, X = Var X1 = 5, X = In the second set of five datasets, x is not correlated with, and the variance of x is graduall increased. The relationship between x 1, x, and is displaed in the charts below.

9 9 Dataset 1: vs. x Dataset : vs. x Dataset 3: vs. x Dataset : vs. x Dataset 5: vs. x x x x x x Dataset 1: vs Dataset : vs Dataset 3: vs Dataset : vs Dataset 5: vs Dataset 1: x vs Dataset : x vs Dataset 3: x vs Dataset : x vs Dataset 5: x vs x x x x x This second example is much starker. The accurac of the principal component regression with one component is barel affected until the variance of x is close to that of x 1, and when the variance of x > x 1, the model ceases to be useful since the principal component regression stops being able to explain an of the variance of the target variable,. (see table ). This is to be expected since the first principal component is almost entirel composed of the feature with the larger variance (see table 3), which is effective so long as this feature is the predictive feature, x 1. Table 3: Principal Component 1: x 1 and x coefficients Model x 1 coefficient x coefficient Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X = Var X1 = 51, X = Var X1 = 51, X =

10 1 Table : Results (R ): X1 correlated with, X uncorrelated Model PCR: comps PCR: 1 comp Var X1 = 51, X = Var X1 = 51, X = Var X1 = 5, X = Var X1 = 51, X = Var X1 = 51, X = Both examples highlight the importance of ensuring that features all have the same variance before carring out PCA, since the composition of the principal components is highl sensitive to the variance of features. However this ma not completel solve the problem encountered in the first set of datasets. If features are correlated, even if the variance of the features are standardized to one, a linear combination of multiple features ma have a variance > 1. For example, if the best predictor of the target variable is a single feature, since it has a lower variance than a linear combination of correlated features, the first principal component will not be as predictive as that feature. In other words, the direction of highest variance in the data is not the most meaningful, which is contrar to a fundamental assumption of PCA. B standardizing the variance of x 1 and x and re-running the analsis from the first set of five datasets (see table 5), this issue becomes clear. Whilst the accurac of the principal component regression with one principal component has improved compared to when feature variance was not standardized, it is still significantl worse than the linear model = β + β 1 x 1. Table 5: Results (R ): X, X1 correlated with, X noisier, standard variance Model PCR: PC PCR: 1 PC = β + β 1 x 1 = β + β 1 x Var =7, x=57, std var =x= Var =5, x=1, std var =x= Var =55, x=1, std var =x= Var =9, x =1, std var =x= Var =7, x=111, std var =x= Section 3: Predicting Median Household Income in US Counties This section examines the effect of changing the number of principal components on the performance of principal component regression. Then, the performance of principal component regression is compared with linear regression and ridge regression with the objective of evaluating whether PCA is a good approach for preventing overfitting.

11 11 The dataset 17 used through this section is a set of 15 features for US counties, and this information is used to predict the median household income b count. The data is structured into an 3,13 x 15 matrix, X, in which a row represents the information for a single count, and a column represents a single feature. The 15 features are as follows: (1) Population 1, absolute () Population 1 measured in April, absolute (3) Population 1 measure at end of ear, absolute () Number of Veterans, absolute (5) Number of Housing Units, absolute () Private non farm emploment, absolute (7) Total number of firms, absolute () Manufacturing shipments, $k (9) Retail sales, value, $k (1) Accommodation and food services sales, $k (11) Foreign born persons, % (1) High school graduates or higher, % (13) Bachelor s degree or higher, % (1) Persons below the povert line, % (15) Women owned firms, % The two figures below plot each feature vs. median household income. The first figure plots the raw data. In the second figure the data has been mean-normalized and the features scaled b dividing each element of each feature vector b the corresponding standard deviation for that feature. Examining the plots, a first observation is that the range of values of the feature varies significantl. From the mathematical discussion in section two, it is clear that PCA is highl sensitive to the absolute range of features since it affects the variance of these features. If the data analzed as is, the principal components would be dominated b the variance in the features with the largest absolute values. This is not necessaril of interest, since it might be the case that variabilit of features with a smaller set of absolute values, for 17 Source: census.gov. See references for link to data

12 1 example, the % of the population with a BA degree or higher, are better predictors of median income than sa, the size of a count. So, in order to be able to capture informative differences in the variation of a feature, all of the analsis is performed on the scaled dataset. It also appears that features 1-1 are highl correlated and are essentiall different measures of the size of a count. This is confirmed b examining the covariance matrix 1 of the features (columns of X). Feature 11, the percent of foreign born persons is also somewhat correlated to features 1-1 and not to an others. Feature 1, the percent of high school graduates or higher, is not correlated with features 1-11, but it reasonabl strongl correlated with the percent holding a bachelor s degree or higher (feature 13) and (inversel) with the percent of persons below the povert line (feature 1). Feature 13, the percent holding a bachelor s degree or higher is most strongl correlated with feature 1, and somewhat (inversel) with feature 1. Feature 15, the percent of women owned firms, is not strongl correlated with an other feature. 1 see Appendix

13 13 1 Counties : raw data 1 Counties : raw data Pop Counties : raw data 1 M HH Inc M HH Inc Pop Counties : raw data 1 1 Counties : raw data Housing Units M HH Inc M HH Inc Counties : raw data Num firms Counties : raw data Retail sales Counties : raw data BA degree or higher M HH Inc M HH Inc % foreign born 1 Counties : raw data 1 Counties : raw data % below povert line Counties : raw data % women owned firms High school or higher 1 Accom and food sales M HH Inc M HH Inc M HH Inc Manufacturing shipments Counties : raw data Counties : raw data 3 Private non farm emploment 1 Num veterans Counties : raw data 1 M HH Inc. 7-1 M HH Inc Pop 1 M HH Inc Counties : raw data 1 M HH Inc M HH Inc. 7-1 M HH Inc

14 1 1 Counties : norm,sc 1 1 Counties : norm,sc 1 M HH Inc M HH Inc Counties : norm,sc Counties : norm,sc 1 1 Counties : norm,sc M HH Inc M HH Inc Housing Units Counties : norm,sc Retail sales BA degree or higher M HH Inc M HH Inc % foreign born Counties : norm,sc -5 1 Counties : norm,sc 1 Accom and food sales 1 Counties : norm,sc - M HH Inc M HH Inc M HH Inc Manufacturing shipments Counties : norm,sc Num firms 1 Counties : norm,sc 1 Private non farm emploment 1 Counties : norm,sc 5 Num veterans 1 Pop Counties : norm,sc Pop 1 1 M HH Inc. 7-1 M HH Inc M HH Inc Counties : norm,sc 1 Pop 1 M HH Inc M HH Inc. 7-1 M HH Inc High school or higher 1 Counties : norm,sc - - % below povert line % women owned firms Given the high degree of correlation between features, it should be possible to reduce the number of principal components used without having a significant impact on the performance of a principal component regression model. The coefficient matrix (see Appendix) helps us to learn more about the structure of the data. Component 1 most heavil weights features 1-1 in the original dataset. This supports the hpothesis that features 1-1 are related variations of one measure, the size of a count. Component is mostl composed of features 1-1, indicating that a linear combination of these three features is the next highest direction of variance in the data. Components 3 and are mostl a combination of features 11 and 15, whilst component 5 is

15 15 mostl a combination of features 13 and 1. The largest direction of variance in the data is count size, the next four highest directions of variance are mostl a combination of features However, it ma still have been useful to include all of features 1-1 since the first 5 principal components explain the majorit, 7.3%, of the variance in the data, but not all (see tables below). Percentage of variance explained b the first k principal components (I / II) Num. Principal Components Percent Variance Explained Percentage of variance explained b the first k principal components (II / II) Num. Principal Components Percent Variance Explained Next, I compared the performance of principal component regression 19 with linear regression and ridge regression. What follows is a brief summar of the methodolog and results; see the Appendix for further detail. The dataset was split into a training set containing % of the samples (counties), randoml selected. % was kept as a holdout set for testing how well the models generalized to unseen data. For both principal component regression and ridge regression, 1-fold cross-validation was used to select the number of principal components and value of the regularization parameter, k 1. As the number of principal components is reduced from 1 to, the performance of a principal component regression on training data stas stable with an R of 77% 7% (see graph below), after which it starts to fall, and begins to drop precipitousl with fewer than 5 components. This suggests that there are a minimum of 5 dimensions that are critical to predicting median count income. The performance of model on unseen test data varies significantl between 5 and 1 components, suggesting some overfitting of the data, but follows the same pattern as the training data with fewer than 5 components. Five principal components therefore appears to be the optimal number to use. 19 Principal component regression is simpl linear regression using k n columns of the transformed dataset X P CA = XV to predict the target variable, The training data was further split into 1 different sets of training data and validation data. The training data in each set contained 9% of the total training set samples, randoml selected, and the remaining 1% was held out as a validation set to test the model on unseen data. 1 Whilst the value of k that achieved the highest performance on the validation set was k =.1, values of k from.3 to 1 performed significantl worse. I therefore chose a more conservative k = 3 which also achieved a good result on the validation set

16 1.. Principal component rsq rsq test rsq training.7 Principal component regression rsq Number of principal components Once the the number of principal components and size of k had been fixed (at PC = 5 and k = 3), each model was tested on the original test dataset using the results from the training data. Finall, six further training and holdout datasets were generated, the data split randoml between the two, with the same ratio of :, training:test. Taking inspiration from Janecek, Gansterer, Demel and Ecker s paper, On the Relationship between Feature Selection and Classification Accurac, JMLR: Workshop and Conference Proceedings : 9-15, I also created two combined models to test on these additional datasets. For the first combined model, the first 5 principal components were added to the original matrix after which a normal linear regression process was followed. The second model used the same combined dataset but followed a ridge regression process. For each set, each model was trained on the training data and tested on the holdout dataset. The performance of all the models (R ) on the unseen holdout set is summarized below: Results: Model performance on holdout test sets (bold font = best model per test) Model Result 1 Result Result 3 Result Result 5 Result Result 7 Avg PCR LR RR LR + PCR RR + PCR Average

17 17 Overall, ridge regression generalized best to unseen data, whilst principal component regression performed worst. The strong performance of the simple linear regression provides some insight as to wh. Since the linear model generalized well to unseen data, this suggests that although there was high correlation between the first 1 features, the still contained information that was useful for predicting median income. That is, the informative number of dimensions in the dataset was closer to 15 than 5, the number of dimensions chosen for the principal component regression. The combined models were not a success. Linear regression and PCA generated exactl the same results as linear regression. This is to be expected given that the principal components are linear combinations of the columns of the original dataset, and so adding principal components simpl duplicates information. Ridge regression and PCA was more interesting since on two of six occasions this model achieved the best result. However, in these cases the performance of the ridge regression was close, and overall, ridge regression outperformed the combined model. Examining the difference in performance between the training and the holdout set (displaed in the tables below) shows that principal component regression does generalize relativel better than linear regression, but not as well as ridge regression. Linear regression performed best on the training data but also overfit the most, with a 3.3 percentage points drop in performance between the training and test data. In contrast, principal component regression onl had a. percentage points drop, but ridge regression overfit the least, with onl a 1. percentage points drop. Results: Model performance on training sets (bold font = best model per test) Model Result 1 Result Result 3 Result Result 5 Result Result 7 Avg PCR LR RR LR + PCR * RR + PCR Average * Note: Linear regression and linear regression + PCA lead to the same results. One less result for LR+PCA Comparing average performance: training set vs. test set

18 1 Model Training Avg. Test Avg. Delta (test - training) PCR LR RR LR + PCR RR + PCR Average Based on this analsis, it appears that ridge regression is the best model to use if overfitting is a concern. It has the benefit of not discarding an information which could potentiall be useful but penalizes large coefficients leading to better performance on unseen data. It is however interesting to see that principal component regression achieves performance to within percentage points of ridge regression despite using a dataset one third of the size. Section : Conclusion Principal Component Analsis reveals useful information about the nature and structure of a set of features and so is a valuable process to follow during initial data exploration. Additionall, it provides an elegant wa to reduce the size of a dataset with minimal loss of information. Thus minimizing the fall in performance of a linear model as the size of the dataset is reduced. However, the performance of principal component regression suggests that if data compression is not necessar, a better approach to preventing overfitting of linear models is to use ridge regression. It would be interesting to extend this analsis b comparing the performance of principal component regression and ridge regression on a dataset for which a simple linear model significantl overfits unseen data, i.e. that does not generalize well. Additionall it would be informative to explore the effect of appling principal component analsis to reduce the size of a dataset to use with more complex models such as neural networks or k-nearest-neighbors classification. I would be interested to understand under what circumstances, if an, reducing the size of a dataset using PCA improves the performance of a predictive model on unseen data when compared with training data. Acknowledgements Thank ou to Professor Margaret Wright for giving me the idea for a project on Principal Component Regression in the first place, and for pointing me towards the excellent tutorial on the subject b Jonathon Shlens; to Clinical Assistant Professor Sophie Marques for discussing the mathematics of orthogonal transformations and for sparking the idea to sstematicall change the variance of x ; and to Professor David Rosenberg for suggesting I compare principal component regression to ridge regression and that I read Elements of Statistical Learning.

19 19 References (1) Linear Algebra and its Applications, David La, Chapter 7 () Elements of Statistical Learning, Hastie, Tibshirani, Friedman, Chapter 3 (3) A Tutorial on Principal Component Analsis, Jonathan Shlens () On the Relationship between Feature Selection and Classification Accurac, JMLR, Janecek, Gansterer, Demel, Ecker (5) Coursera, Machine Learning, Andrew Ng, week 7 (K means clustering and PCA) () PCA or Polluting our Clever Analsis: This gave me the idea for examining the performance of principal component regression as the variance of the features and their relationship to the predictor variable changed. http : // bloggers.com/pca or polluting our clever analsis/ (7) Mathworks: tutorials on principal component analsis and ridge regression () Modern Regression: Ridge regression, Ran Tibshirani, March 13 (9) Lecture 17 Principal Component Analsis, Shippensburg Universit of Pennslvania (1) PCA, SVD, and LDA, Shanshan Ding (11) http : //atani.jp/teaching/doku.php?id = hcistats : P CA (1) https : //en.wikipedia.org/wiki/p rincipal c omponent a nalsis (13) http : //stats.stackexchange.com/questions/13/relationship between svd and pca how to use svd to perform pca (1) US Count Data: census.gov http : //quickfacts.census.gov/qfd/download d ata.html

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection