Principal Component Analysis

Size: px
Start display at page:

Download "Principal Component Analysis"

Transcription

1 Principal Component Analsis L. Graesser March 1, 1 Introduction Principal Component Analsis (PCA) is a popular technique for reducing the size of a dataset. Let s assume that the dataset is structured into an m x n matrix where each row represents a data sample and each column represents a feature. Reducing the size of the dataset is achieved b first linearl transforming the dataset, retaining the original dimensions, and then selecting the first 1 : k, (k < n) columns of the resulting matrix. PCA is useful for three reasons. First, it reduces the amount of memor a dataset takes up with the minimal loss of information, and so can significantl speed up the runtime of learning algorithms. Second, it reveals the underling structure of the data, and in particular the relationship between features. It reveals how man uncorrelated relationships there are in a dataset. This is the rank of a dataset, the maximum number of linearl independent column vectors. More formall, PCA re-expresses the features of a dataset in an orthogonal basis of the same dimension as the original features. This has the nice propert that the new features are uncorrelated. That is, the new set of features are an orthogonal set of vectors. Finall, PCA can help to prevent a learning algorithm from overfitting the training data. This objective of this paper is to explain what PCA is and to explore when it is and is not useful for data analsis. Section one explains the mathematics of PCA. It starts with the desirable properties of the transformed dataset and works through the mathematics which guarantee these properties. It is intended to be understood b a reader who has a basic understanding of linear algebra, but can be skipped if readers wish onl to see the application of PCA to datasets. Section two uses to datasets to demonstrate what happens to the principal components and the accurac of a simple principal component regression when the variance of features change. Section three explores the trade off between dimensionalit reduction using PCA and the performance of a linear model. It compares the performance of linear regression, ridge regression and principal component regression in predicting the median household income of US counties. The objective of this section is to show how the accurac of principal component regression changes as the number of principal components (dimensions) is reduced, and to assess how effective PCA is in preventing overfitting when compared to ridge regression 1. The paper ends with a summar of the results and a brief discussion of possible extensions. A separate appendix is provided which contains the detailed results of the median income analsis and Matlab code for a number of functions that I found useful to write during this process is available on github at 1 regularized linear regression 1

2 Section 1: The Mathematics of Principal Component Analsis Let X be an m x n matrix. Each row represents an individual data sample, so there are m samples, and each column represents a feature, so there are n features. Let X = [x 1,..., x n ], where x i is a column vector R m. Principal Component Analsis is the process of transforming a dataset, X, into a new dataset Y = [ 1,..., n ], i R m. The objective is to find a transformation matrix, P, such that Y = XP, and which ensures that Y has the following properties. (A) Each feature, i, is uncorrelated with all other features. That is, the covariance 3 between features is. Let µ i = 1 m m j=1 j i, the mean of vector i, then the covariance between two features, i, and j is as follows. Cov i, j = ( i µ i ) T ( j µ j ) m 1 = T i j m 1 = i j Note: the rightmost equalit holds if and onl if i and j are mean normalized, which b assumption the are. See footnote. (B) The features are ordered from highest to lowest variance, left to right. 1 should have the highest variance, n the lowest. This will turn out to be essential when reducing the dimensionalit of the dataset Y b choosing the first k < n columns of Y. (C) The total variance of the original dataset is unchanged, that is X and Y have the same total variance. The columns of Y are then called the principal components of X. Mathematicall, Principal Component Analsis is ver elegant. At its heart is the propert that all smmetric matrices are orthogonall diagonalizable 5. This means that for an smmetric n x n matrix, A, there exists an orthogonal matrix E = [e 1,..., e n ], and a diagonal matrix, D, such that, Throughout this derivation, the resulting matrix, Y, is assumed to be in mean deviation form. This means that the mean of each i Y has been subtracted from each element of i. This guarantees that the mean of each i Y is. 3 a measure of how much changes in one variable is associated with changes in another variable) between each feature A smmetric matrix is a matrix where A = A T 5 The Spectral Theorem of Smmetric Matrices, La An orthogonal matrix is a matrix in which the norm (length) of each of its columns = 1, and ever column vector is orthogonal to ever other column vector. Let E = [e 1,..., e n ] be an orthogonal matrix, then, e T i e j = i j and, e T i e j = 1 i = j

3 3 A = EDE 1 = EDE T, since E is orthogonal and square and so E T E = I The transformation matrix, P, will be derived in two was. First, it is derived using the fact that all smmetric matrices are orthogonall diagonalizable. Second, it is derived using the Singular Value Decomposition (SVD). Looking at both derivations is interesting because it highlights the relationship between the eigenvalues of X T X, the variance of the feature vectors, and the Singular Value Decomposition, one of the most widel used matrix decompositions. In practice, it is often easiest to find P and the resulting Y using the SVD. Derivation 1: Let Y = XP, and suppose that Y is mean-normalized. To satisf propert (A), the covariance of the columns of Y need to be zero, which is equivalent to requiring that Y T Y be a diagonal matrix. X T X is a smmetric matrix (since (X T X) T = X T X T T = X T X), so there exists an orthogonal matrix, E, and a diagonal matrix, D, such that X T X = EDE T. So, we have Let P = E, then Y T Y = (XP T ) T (XP ) = P T X T XP = P T EDE T P Y T Y = E T EDE T E From this, it is clear that propert (A) is satisfied. Let (d 1,..., d n ) be the diagonals of D. Then, we have that, 1 T ( ) Y T Y = n = D (1) = T n 1 T 1 1 T... 1 T n... n T 1 n T... n T n d 1... d... =... ()... d n The covariance of features i and j, i j, is zero. Which was what we were looking for. The fact that P is an orthogonal matrix also guarantees that propert (C) is satisfied,

4 since an orthogonal change of variables does not change the total variance of the data 7. At this point we know that a matrix, P, exists such that Y = XP will satisf properties (A) and (C). But we do not et know if propert (B) is satisfied. The second derivation using the Singular Value Decomposition will make this clear. In this next section it is useful to keep in mind the properties of matrices E and D from this derivation. The diagonals of D contain the distinct eigenvalues of X T X, and the columns of E are the corresponding eigenvectors 9, which b definition are an orthogonal set since an two eigenvectors corresponding to distinct eigenvalues (distinct eigenspaces) of a smmetric matrix are orthogonal 1. Derivation : The following theorem 11 will be crucial to guaranteeing that propert (B) is satisfied as it is closel related to the variance of the new features vectors in Y. Let A be an matrix, let x be an unit vector of appropriate dimension, and let λ 1 be the greatest eigenvalue of A T A. Then, the maximum length that Ax can have is, max Ax = max((ax) T Ax)) 1 = max (x T A T Ax) 1 = λ 1 1 The Singular Value Decomposition is an extremel useful matrix decomposition since it states that an m x n matrix A can be decomposed as follows, A = UΣV T U is an orthogonal m x m matrix V is an orthogonal n x n matrix ( ) D Σ = Note: if Σ is square, the bottom right -block is excluded (3) Where Σ is m x n, D is a diagonal r x r matrix, r = Column rank (A) = rank(a), and where the diagonals of D are the first r singular values of A ordered from largest to smallest, left to right, that is, σ 1 σ... σ r >, where σ i is a non zero diagonal entr of D. 7 Linear Algebra and its Applications, La, pg 1 An eigenvalue is a scalar, λ, such that Ax = λx, where x is non zero. 9 An eigenvector x is a non zero vector such that, Ax = λx, where A is an m x n matrix, λ is a scalar. Eigenvectors are said to correspond to eigenvalues. 1 The Spectral Theorem of Smmetric Matrices 11 Linear Algebra and its Applications, La, pg

5 5 Let X = UΣV T and substituting X = UΣV T into Y T Y = P T X T XP, we have, Y T Y = P T X T XP = P T (UΣV T ) T (UΣV T )P = P T V Σ T U T UΣV T P = P T V Σ V T P Let P = V, then Y T Y = V T V Σ V T V = Σ () From this, we know that, Squared singular values of X = Σ Singular vectors of X = V = D = eigenvalues of X T X (after ordering the eigenvalues largest to smallest, left to right) = E = orthogonal set of eigenvectors of X T X (ordered to correspond to the ordered eigenvalues of D) (5) Now we can prove that this transformation P, does in fact guarantee that propert (B) is satisfied. We know that, X = UΣV T Y = XP = XV = UΣV T V = UΣ = [σ 1 u 1,..., σ r u r,..., ] () i = σ i u i Now, consider the variance of the first feature of Y, 1, V ar 1 = = m (( 1 µ 1 ) ) i=1 m 1 i=1 since Y is mean normalized = T 1 1 = (σ 1 u 1 ) T (σ 1 u 1 ) (7) = u T 1 σ1u 1 = σ1u T 1 u 1 since σ 1 is a scalar = σ1 since U is orthogonal so u 1 is a unit vector = λ 1, the largest eigenvector of X T X

6 Recall that the maximum value that Xa can take for an appropriatel dimensioned unit vector a, is λ 1, the largest eigenvalue of X T X. So it is not possible for the variance of an of the -vectors to be larger that that of 1. And, since the transformation matrix, V, preserves the variance of X, we know that 1 is the vector representing the maximal direction variance of the features (columns) of matrix X. From an extension of the same theorem 1, it can be established that is the vector representing the direction of the second highest variance of the matrix X, constrained b that fact that 1 and are orthogonal. And more generall, i is the vector representing the direction of maximum variance of X given that i and j are orthogonal j < i. So, the feature vectors (columns) of Y are ordered from the direction of the highest variance in X to the lowest variance, from left to right. It is now clear that our transformed dataset Y = XV, where V is the matrix of right singular vectors of the Singular Value Decomposition of X, has all of the properties that we were looking for. Having obtained our matrix of principal components, Y, we can now reduce the dimensionalit of Y simpl b selecting the leftmost k < n columns, knowing that these features contain the maximum possible information (variance) about the original dataset X, for an k < n. The proportion of the variance of the original data captured in these k columns is calculated b the sum of the first k singular values of X divided b the sum of all of the singular values of X. How man principal components to choose depends on the application and the dataset. We will see an example of this in practice in Section 3. Section : PCA applied: changing variable variance in principal component regression The following simple examples demonstrate how the principal components of a dataset and the performance of a principal component regression 13 changes as the variance of the features change. To show this I created two datasets, which each contained five datasets. Each dataset contained two features, x 1 and x R 1, and one target variable, R 1. All features were mean normalized. In ever dataset, x 1 is a strong predictor of, and a linear model, = β + β 1 x 1 has an R of 97 9% for each dataset 1. In the first set of five datasets, x is also correlated with, but the feature contains more noise 15, and so is a less good predictor of. The relationship between x 1, x, and is displaed in the charts below. 1 Linear Algebra and its Applications, La, pg 3 13 Principal component regression is simpl linear regression using the first k n columns of the transformed dataset Y = XV to predict a target variable 1 R is the percentage of variance of the target variable explained b the model. It is calculated as follows: Total sum of squares (TSS) = i ( i ȳ). Explained sum of squares (ESS) = i (f i ȳ). Residual sum of squares (RSS) = i ( i f i ) R = 1 RSS/T SS. 15 modeled b increasing the variance of x in a wa that is uncorrelated with

7 7 Dataset 1: vs. x Dataset : vs. x Dataset 3: vs. x Dataset : vs. x Dataset 5: vs. x x x x x x Dataset 1: vs Dataset : vs Dataset 3: vs Dataset : vs Dataset 5: vs Dataset 1: x vs Dataset : x vs Dataset 3: x vs Dataset : x vs Dataset 5: x vs x x x x x As the variance of x increases, the accurac 1 of a principal component regression using just the first principal component falls, despite the fact that is best described as a function of one variable. Whilst principal component regression does not perform as poorl as the linear model = β + β 1 x (see table below), it is significantl worse than the linear model = β + β 1 x 1. This is because the maximal direction of variance, the first principal component, in each dataset is a linear combination of the both x 1 and x, with the relative weight of x increasing as the variance of x increases (see table 1 below). So, the presence of the high variance but less good predictor, x, makes the model worse. 1 measured b R

8 Table 1: Principal Component 1: x 1 and x coefficients Model x 1 coefficient x coefficient Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X =.5. Var X1 = 9, X = Var X1 = 5, X = Table : Results (R ): X, X1 correlated with, X variance increased Model PCR: comps PCR: 1 comp = β + β 1 x 1 = β + β 1 x Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X = Var X1 = 9, X = Var X1 = 5, X = In the second set of five datasets, x is not correlated with, and the variance of x is graduall increased. The relationship between x 1, x, and is displaed in the charts below.

9 9 Dataset 1: vs. x Dataset : vs. x Dataset 3: vs. x Dataset : vs. x Dataset 5: vs. x x x x x x Dataset 1: vs Dataset : vs Dataset 3: vs Dataset : vs Dataset 5: vs Dataset 1: x vs Dataset : x vs Dataset 3: x vs Dataset : x vs Dataset 5: x vs x x x x x This second example is much starker. The accurac of the principal component regression with one component is barel affected until the variance of x is close to that of x 1, and when the variance of x > x 1, the model ceases to be useful since the principal component regression stops being able to explain an of the variance of the target variable,. (see table ). This is to be expected since the first principal component is almost entirel composed of the feature with the larger variance (see table 3), which is effective so long as this feature is the predictive feature, x 1. Table 3: Principal Component 1: x 1 and x coefficients Model x 1 coefficient x coefficient Var X1 = 51, X = Var X1 = 5, X = Var X1 = 5, X = Var X1 = 51, X = Var X1 = 51, X =

10 1 Table : Results (R ): X1 correlated with, X uncorrelated Model PCR: comps PCR: 1 comp Var X1 = 51, X = Var X1 = 51, X = Var X1 = 5, X = Var X1 = 51, X = Var X1 = 51, X = Both examples highlight the importance of ensuring that features all have the same variance before carring out PCA, since the composition of the principal components is highl sensitive to the variance of features. However this ma not completel solve the problem encountered in the first set of datasets. If features are correlated, even if the variance of the features are standardized to one, a linear combination of multiple features ma have a variance > 1. For example, if the best predictor of the target variable is a single feature, since it has a lower variance than a linear combination of correlated features, the first principal component will not be as predictive as that feature. In other words, the direction of highest variance in the data is not the most meaningful, which is contrar to a fundamental assumption of PCA. B standardizing the variance of x 1 and x and re-running the analsis from the first set of five datasets (see table 5), this issue becomes clear. Whilst the accurac of the principal component regression with one principal component has improved compared to when feature variance was not standardized, it is still significantl worse than the linear model = β + β 1 x 1. Table 5: Results (R ): X, X1 correlated with, X noisier, standard variance Model PCR: PC PCR: 1 PC = β + β 1 x 1 = β + β 1 x Var =7, x=57, std var =x= Var =5, x=1, std var =x= Var =55, x=1, std var =x= Var =9, x =1, std var =x= Var =7, x=111, std var =x= Section 3: Predicting Median Household Income in US Counties This section examines the effect of changing the number of principal components on the performance of principal component regression. Then, the performance of principal component regression is compared with linear regression and ridge regression with the objective of evaluating whether PCA is a good approach for preventing overfitting.

11 11 The dataset 17 used through this section is a set of 15 features for US counties, and this information is used to predict the median household income b count. The data is structured into an 3,13 x 15 matrix, X, in which a row represents the information for a single count, and a column represents a single feature. The 15 features are as follows: (1) Population 1, absolute () Population 1 measured in April, absolute (3) Population 1 measure at end of ear, absolute () Number of Veterans, absolute (5) Number of Housing Units, absolute () Private non farm emploment, absolute (7) Total number of firms, absolute () Manufacturing shipments, $k (9) Retail sales, value, $k (1) Accommodation and food services sales, $k (11) Foreign born persons, % (1) High school graduates or higher, % (13) Bachelor s degree or higher, % (1) Persons below the povert line, % (15) Women owned firms, % The two figures below plot each feature vs. median household income. The first figure plots the raw data. In the second figure the data has been mean-normalized and the features scaled b dividing each element of each feature vector b the corresponding standard deviation for that feature. Examining the plots, a first observation is that the range of values of the feature varies significantl. From the mathematical discussion in section two, it is clear that PCA is highl sensitive to the absolute range of features since it affects the variance of these features. If the data analzed as is, the principal components would be dominated b the variance in the features with the largest absolute values. This is not necessaril of interest, since it might be the case that variabilit of features with a smaller set of absolute values, for 17 Source: census.gov. See references for link to data

12 1 example, the % of the population with a BA degree or higher, are better predictors of median income than sa, the size of a count. So, in order to be able to capture informative differences in the variation of a feature, all of the analsis is performed on the scaled dataset. It also appears that features 1-1 are highl correlated and are essentiall different measures of the size of a count. This is confirmed b examining the covariance matrix 1 of the features (columns of X). Feature 11, the percent of foreign born persons is also somewhat correlated to features 1-1 and not to an others. Feature 1, the percent of high school graduates or higher, is not correlated with features 1-11, but it reasonabl strongl correlated with the percent holding a bachelor s degree or higher (feature 13) and (inversel) with the percent of persons below the povert line (feature 1). Feature 13, the percent holding a bachelor s degree or higher is most strongl correlated with feature 1, and somewhat (inversel) with feature 1. Feature 15, the percent of women owned firms, is not strongl correlated with an other feature. 1 see Appendix

13 13 1 Counties : raw data 1 Counties : raw data Pop Counties : raw data 1 M HH Inc M HH Inc Pop Counties : raw data 1 1 Counties : raw data Housing Units M HH Inc M HH Inc Counties : raw data Num firms Counties : raw data Retail sales Counties : raw data BA degree or higher M HH Inc M HH Inc % foreign born 1 Counties : raw data 1 Counties : raw data % below povert line Counties : raw data % women owned firms High school or higher 1 Accom and food sales M HH Inc M HH Inc M HH Inc Manufacturing shipments Counties : raw data Counties : raw data 3 Private non farm emploment 1 Num veterans Counties : raw data 1 M HH Inc. 7-1 M HH Inc Pop 1 M HH Inc Counties : raw data 1 M HH Inc M HH Inc. 7-1 M HH Inc

14 1 1 Counties : norm,sc 1 1 Counties : norm,sc 1 M HH Inc M HH Inc Counties : norm,sc Counties : norm,sc 1 1 Counties : norm,sc M HH Inc M HH Inc Housing Units Counties : norm,sc Retail sales BA degree or higher M HH Inc M HH Inc % foreign born Counties : norm,sc -5 1 Counties : norm,sc 1 Accom and food sales 1 Counties : norm,sc - M HH Inc M HH Inc M HH Inc Manufacturing shipments Counties : norm,sc Num firms 1 Counties : norm,sc 1 Private non farm emploment 1 Counties : norm,sc 5 Num veterans 1 Pop Counties : norm,sc Pop 1 1 M HH Inc. 7-1 M HH Inc M HH Inc Counties : norm,sc 1 Pop 1 M HH Inc M HH Inc. 7-1 M HH Inc High school or higher 1 Counties : norm,sc - - % below povert line % women owned firms Given the high degree of correlation between features, it should be possible to reduce the number of principal components used without having a significant impact on the performance of a principal component regression model. The coefficient matrix (see Appendix) helps us to learn more about the structure of the data. Component 1 most heavil weights features 1-1 in the original dataset. This supports the hpothesis that features 1-1 are related variations of one measure, the size of a count. Component is mostl composed of features 1-1, indicating that a linear combination of these three features is the next highest direction of variance in the data. Components 3 and are mostl a combination of features 11 and 15, whilst component 5 is

15 15 mostl a combination of features 13 and 1. The largest direction of variance in the data is count size, the next four highest directions of variance are mostl a combination of features However, it ma still have been useful to include all of features 1-1 since the first 5 principal components explain the majorit, 7.3%, of the variance in the data, but not all (see tables below). Percentage of variance explained b the first k principal components (I / II) Num. Principal Components Percent Variance Explained Percentage of variance explained b the first k principal components (II / II) Num. Principal Components Percent Variance Explained Next, I compared the performance of principal component regression 19 with linear regression and ridge regression. What follows is a brief summar of the methodolog and results; see the Appendix for further detail. The dataset was split into a training set containing % of the samples (counties), randoml selected. % was kept as a holdout set for testing how well the models generalized to unseen data. For both principal component regression and ridge regression, 1-fold cross-validation was used to select the number of principal components and value of the regularization parameter, k 1. As the number of principal components is reduced from 1 to, the performance of a principal component regression on training data stas stable with an R of 77% 7% (see graph below), after which it starts to fall, and begins to drop precipitousl with fewer than 5 components. This suggests that there are a minimum of 5 dimensions that are critical to predicting median count income. The performance of model on unseen test data varies significantl between 5 and 1 components, suggesting some overfitting of the data, but follows the same pattern as the training data with fewer than 5 components. Five principal components therefore appears to be the optimal number to use. 19 Principal component regression is simpl linear regression using k n columns of the transformed dataset X P CA = XV to predict the target variable, The training data was further split into 1 different sets of training data and validation data. The training data in each set contained 9% of the total training set samples, randoml selected, and the remaining 1% was held out as a validation set to test the model on unseen data. 1 Whilst the value of k that achieved the highest performance on the validation set was k =.1, values of k from.3 to 1 performed significantl worse. I therefore chose a more conservative k = 3 which also achieved a good result on the validation set

16 1.. Principal component rsq rsq test rsq training.7 Principal component regression rsq Number of principal components Once the the number of principal components and size of k had been fixed (at PC = 5 and k = 3), each model was tested on the original test dataset using the results from the training data. Finall, six further training and holdout datasets were generated, the data split randoml between the two, with the same ratio of :, training:test. Taking inspiration from Janecek, Gansterer, Demel and Ecker s paper, On the Relationship between Feature Selection and Classification Accurac, JMLR: Workshop and Conference Proceedings : 9-15, I also created two combined models to test on these additional datasets. For the first combined model, the first 5 principal components were added to the original matrix after which a normal linear regression process was followed. The second model used the same combined dataset but followed a ridge regression process. For each set, each model was trained on the training data and tested on the holdout dataset. The performance of all the models (R ) on the unseen holdout set is summarized below: Results: Model performance on holdout test sets (bold font = best model per test) Model Result 1 Result Result 3 Result Result 5 Result Result 7 Avg PCR LR RR LR + PCR RR + PCR Average

17 17 Overall, ridge regression generalized best to unseen data, whilst principal component regression performed worst. The strong performance of the simple linear regression provides some insight as to wh. Since the linear model generalized well to unseen data, this suggests that although there was high correlation between the first 1 features, the still contained information that was useful for predicting median income. That is, the informative number of dimensions in the dataset was closer to 15 than 5, the number of dimensions chosen for the principal component regression. The combined models were not a success. Linear regression and PCA generated exactl the same results as linear regression. This is to be expected given that the principal components are linear combinations of the columns of the original dataset, and so adding principal components simpl duplicates information. Ridge regression and PCA was more interesting since on two of six occasions this model achieved the best result. However, in these cases the performance of the ridge regression was close, and overall, ridge regression outperformed the combined model. Examining the difference in performance between the training and the holdout set (displaed in the tables below) shows that principal component regression does generalize relativel better than linear regression, but not as well as ridge regression. Linear regression performed best on the training data but also overfit the most, with a 3.3 percentage points drop in performance between the training and test data. In contrast, principal component regression onl had a. percentage points drop, but ridge regression overfit the least, with onl a 1. percentage points drop. Results: Model performance on training sets (bold font = best model per test) Model Result 1 Result Result 3 Result Result 5 Result Result 7 Avg PCR LR RR LR + PCR * RR + PCR Average * Note: Linear regression and linear regression + PCA lead to the same results. One less result for LR+PCA Comparing average performance: training set vs. test set

18 1 Model Training Avg. Test Avg. Delta (test - training) PCR LR RR LR + PCR RR + PCR Average Based on this analsis, it appears that ridge regression is the best model to use if overfitting is a concern. It has the benefit of not discarding an information which could potentiall be useful but penalizes large coefficients leading to better performance on unseen data. It is however interesting to see that principal component regression achieves performance to within percentage points of ridge regression despite using a dataset one third of the size. Section : Conclusion Principal Component Analsis reveals useful information about the nature and structure of a set of features and so is a valuable process to follow during initial data exploration. Additionall, it provides an elegant wa to reduce the size of a dataset with minimal loss of information. Thus minimizing the fall in performance of a linear model as the size of the dataset is reduced. However, the performance of principal component regression suggests that if data compression is not necessar, a better approach to preventing overfitting of linear models is to use ridge regression. It would be interesting to extend this analsis b comparing the performance of principal component regression and ridge regression on a dataset for which a simple linear model significantl overfits unseen data, i.e. that does not generalize well. Additionall it would be informative to explore the effect of appling principal component analsis to reduce the size of a dataset to use with more complex models such as neural networks or k-nearest-neighbors classification. I would be interested to understand under what circumstances, if an, reducing the size of a dataset using PCA improves the performance of a predictive model on unseen data when compared with training data. Acknowledgements Thank ou to Professor Margaret Wright for giving me the idea for a project on Principal Component Regression in the first place, and for pointing me towards the excellent tutorial on the subject b Jonathon Shlens; to Clinical Assistant Professor Sophie Marques for discussing the mathematics of orthogonal transformations and for sparking the idea to sstematicall change the variance of x ; and to Professor David Rosenberg for suggesting I compare principal component regression to ridge regression and that I read Elements of Statistical Learning.

19 19 References (1) Linear Algebra and its Applications, David La, Chapter 7 () Elements of Statistical Learning, Hastie, Tibshirani, Friedman, Chapter 3 (3) A Tutorial on Principal Component Analsis, Jonathan Shlens () On the Relationship between Feature Selection and Classification Accurac, JMLR, Janecek, Gansterer, Demel, Ecker (5) Coursera, Machine Learning, Andrew Ng, week 7 (K means clustering and PCA) () PCA or Polluting our Clever Analsis: This gave me the idea for examining the performance of principal component regression as the variance of the features and their relationship to the predictor variable changed. http : // bloggers.com/pca or polluting our clever analsis/ (7) Mathworks: tutorials on principal component analsis and ridge regression () Modern Regression: Ridge regression, Ran Tibshirani, March 13 (9) Lecture 17 Principal Component Analsis, Shippensburg Universit of Pennslvania (1) PCA, SVD, and LDA, Shanshan Ding (11) http : //atani.jp/teaching/doku.php?id = hcistats : P CA (1) https : //en.wikipedia.org/wiki/p rincipal c omponent a nalsis (13) http : //stats.stackexchange.com/questions/13/relationship between svd and pca how to use svd to perform pca (1) US Count Data: census.gov http : //quickfacts.census.gov/qfd/download d ata.html

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) Principal Components Analysis (PCA) Principal Components Analysis (PCA) a technique for finding patterns in data of high dimension Outline:. Eigenvectors and eigenvalues. PCA: a) Getting the data b) Centering

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Singular Value Decomposition

Singular Value Decomposition Singular Value Decomposition Motivatation The diagonalization theorem play a part in many interesting applications. Unfortunately not all matrices can be factored as A = PDP However a factorization A =

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Anders Øland David Christiansen 1 Introduction Principal Component Analysis, or PCA, is a commonly used multi-purpose technique in data analysis. It can be used for feature

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab 1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed in the

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) Chapter 5 The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 5.1 Basics of SVD 5.1.1 Review of Key Concepts We review some key definitions and results about matrices that will

More information

The Singular Value Decomposition

The Singular Value Decomposition The Singular Value Decomposition Philippe B. Laval KSU Fall 2015 Philippe B. Laval (KSU) SVD Fall 2015 1 / 13 Review of Key Concepts We review some key definitions and results about matrices that will

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

Statistical foundations

Statistical foundations Statistical foundations Michael K. Tippett International Research Institute for Climate and Societ The Earth Institute, Columbia Universit ERFS Climate Predictabilit Tool Training Workshop Ma 4-9, 29 Ideas

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Data Mining Lecture 4: Covariance, EVD, PCA & SVD Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The

More information

1 GSW Gaussian Elimination

1 GSW Gaussian Elimination Gaussian elimination is probabl the simplest technique for solving a set of simultaneous linear equations, such as: = A x + A x + A x +... + A x,,,, n n = A x + A x + A x +... + A x,,,, n n... m = Am,x

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Numerical Methods I Singular Value Decomposition

Numerical Methods I Singular Value Decomposition Numerical Methods I Singular Value Decomposition Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 9th, 2014 A. Donev (Courant Institute)

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Chapter 7: Symmetric Matrices and Quadratic Forms

Chapter 7: Symmetric Matrices and Quadratic Forms Chapter 7: Symmetric Matrices and Quadratic Forms (Last Updated: December, 06) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved

More information

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17 Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 17 Outline Filters and Rotations Generating co-varying random fields Translating co-varying fields into

More information

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare COMP6237 Data Mining Covariance, EVD, PCA & SVD Jonathon Hare jsh2@ecs.soton.ac.uk Variance and Covariance Random Variables and Expected Values Mathematicians talk variance (and covariance) in terms of

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach Dr. Guangliang Chen February 9, 2016 Outline Introduction Review of linear algebra Matrix SVD PCA Motivation The digits

More information

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University Chemometrics Matti Hotokka Physical chemistry Åbo Akademi University Linear regression Experiment Consider spectrophotometry as an example Beer-Lamberts law: A = cå Experiment Make three known references

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Computation. For QDA we need to calculate: Lets first consider the case that

Computation. For QDA we need to calculate: Lets first consider the case that Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the

More information

Machine Learning 11. week

Machine Learning 11. week Machine Learning 11. week Feature Extraction-Selection Dimension reduction PCA LDA 1 Feature Extraction Any problem can be solved by machine learning methods in case of that the system must be appropriately

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R, Principal Component Analysis (PCA) PCA is a widely used statistical tool for dimension reduction. The objective of PCA is to find common factors, the so called principal components, in form of linear combinations

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Example: Face Detection

Example: Face Detection Announcements HW1 returned New attendance policy Face Recognition: Dimensionality Reduction On time: 1 point Five minutes or more late: 0.5 points Absent: 0 points Biometrics CSE 190 Lecture 14 CSE190,

More information

20 Unsupervised Learning and Principal Components Analysis (PCA)

20 Unsupervised Learning and Principal Components Analysis (PCA) 116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:

More information

Announcements (repeat) Principal Components Analysis

Announcements (repeat) Principal Components Analysis 4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long

More information

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

SOME APPLICATIONS: NONLINEAR REGRESSIONS BASED ON KERNEL METHOD IN SOCIAL SCIENCES AND ENGINEERING

SOME APPLICATIONS: NONLINEAR REGRESSIONS BASED ON KERNEL METHOD IN SOCIAL SCIENCES AND ENGINEERING SOME APPLICATIONS: NONLINEAR REGRESSIONS BASED ON KERNEL METHOD IN SOCIAL SCIENCES AND ENGINEERING Antoni Wibowo Farewell Lecture PPI Ibaraki 27 June 2009 EDUCATION BACKGROUNDS Dr.Eng., Social Systems

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 6. Principal component analysis (PCA) 6.1 Overview 6.2 Essentials of PCA 6.3 Numerical calculation of PCs 6.4 Effects of data preprocessing

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014 Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal

More information

Computational Methods. Eigenvalues and Singular Values

Computational Methods. Eigenvalues and Singular Values Computational Methods Eigenvalues and Singular Values Manfred Huber 2010 1 Eigenvalues and Singular Values Eigenvalues and singular values describe important aspects of transformations and of data relations

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Principal

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 5 Singular Value Decomposition We now reach an important Chapter in this course concerned with the Singular Value Decomposition of a matrix A. SVD, as it is commonly referred to, is one of the

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

Learning with Singular Vectors

Learning with Singular Vectors Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:

More information

a Short Introduction

a Short Introduction Collaborative Filtering in Recommender Systems: a Short Introduction Norm Matloff Dept. of Computer Science University of California, Davis matloff@cs.ucdavis.edu December 3, 2016 Abstract There is a strong

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Dimensionality reduction

Dimensionality reduction Dimensionality Reduction PCA continued Machine Learning CSE446 Carlos Guestrin University of Washington May 22, 2013 Carlos Guestrin 2005-2013 1 Dimensionality reduction n Input data may have thousands

More information

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Linear Algebra & Geometry why is linear algebra useful in computer vision? Linear Algebra & Geometry why is linear algebra useful in computer vision? References: -Any book on linear algebra! -[HZ] chapters 2, 4 Some of the slides in this lecture are courtesy to Prof. Octavia

More information

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Recall: Eigenvectors of the Covariance Matrix Covariance matrices are symmetric. Eigenvectors are orthogonal Eigenvectors are ordered by the magnitude of eigenvalues: λ 1 λ 2 λ p {v 1, v 2,..., v n } Recall:

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) 2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised

More information

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MATH 829: Introduction to Data Mining and Analysis Principal component analysis 1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016 Motivation 2/11 High-dimensional

More information

CSE 554 Lecture 7: Alignment

CSE 554 Lecture 7: Alignment CSE 554 Lecture 7: Alignment Fall 2012 CSE554 Alignment Slide 1 Review Fairing (smoothing) Relocating vertices to achieve a smoother appearance Method: centroid averaging Simplification Reducing vertex

More information

Kernel PCA, clustering and canonical correlation analysis

Kernel PCA, clustering and canonical correlation analysis ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

The Mathematics of Facial Recognition

The Mathematics of Facial Recognition William Dean Gowin Graduate Student Appalachian State University July 26, 2007 Outline EigenFaces Deconstruct a known face into an N-dimensional facespace where N is the number of faces in our data set.

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Linear Algebra Review. Fei-Fei Li

Linear Algebra Review. Fei-Fei Li Linear Algebra Review Fei-Fei Li 1 / 37 Vectors Vectors and matrices are just collections of ordered numbers that represent something: movements in space, scaling factors, pixel brightnesses, etc. A vector

More information

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T. Notes on singular value decomposition for Math 54 Recall that if A is a symmetric n n matrix, then A has real eigenvalues λ 1,, λ n (possibly repeated), and R n has an orthonormal basis v 1,, v n, where

More information