MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Similar documents
Introduction to Machine Learning

PCA, Kernel PCA, ICA

Mathematical foundations - linear algebra

Multivariate Statistical Analysis

Lecture: Face Recognition and Feature Reduction

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Dimensionality Reduction

Principal Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis and Linear Discriminant Analysis

LECTURE 16: PCA AND SVD

Lecture: Face Recognition and Feature Reduction

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

MLCC 2015 Dimensionality Reduction and PCA

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

PCA and admixture models

Data Mining and Analysis: Fundamental Concepts and Algorithms

Announcements (repeat) Principal Components Analysis

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Maths for Signals and Systems Linear Algebra in Engineering

Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Principal Components Analysis. Sargur Srihari University at Buffalo

Computation. For QDA we need to calculate: Lets first consider the case that

Ma/CS 6b Class 23: Eigenvalues in Regular Graphs

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis

18.06SC Final Exam Solutions

20 Unsupervised Learning and Principal Components Analysis (PCA)

Methods for sparse analysis of high-dimensional data, II

14 Singular Value Decomposition

Linear Subspace Models

Dimensionality Reduction

Principal Component Analysis (PCA)

1 Singular Value Decomposition and Principal Component

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis

1 Principal Components Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial


Singular Value Decomposition

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Dimensionality Reduction and Principal Components

APPLICATIONS The eigenvalues are λ = 5, 5. An orthonormal basis of eigenvectors consists of

Expectation Maximization

Classification 2: Linear discriminant analysis (continued); logistic regression

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Math 408 Advanced Linear Algebra

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Data Preprocessing Tasks

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

Problem # Max points possible Actual score Total 120

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS281 Section 4: Factor Analysis and PCA

Machine Learning (Spring 2012) Principal Component Analysis

Linear Models Review

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Linear Algebra Methods for Data Mining

Lecture 3: Review of Linear Algebra

Computational Methods CMSC/AMSC/MAPL 460. Eigenvalues and Eigenvectors. Ramani Duraiswami, Dept. of Computer Science

Lecture 3: Review of Linear Algebra

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Lecture 5 Singular value decomposition

Chapter 4: Factor Analysis

15 Singular Value Decomposition

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principal Component Analysis CS498

PRINCIPAL COMPONENTS ANALYSIS

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal Component Analysis

Eigenvalues, Eigenvectors, and an Intro to PCA

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Chapter 6: Orthogonality

Least Squares Optimization

Study Guide for Linear Algebra Exam 2

Tutorials in Optimization. Richard Socher

The Singular Value Decomposition

Eigenvalues and diagonalization

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 7 Spectral methods

What is Principal Component Analysis?

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Transcription:

1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p.

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p.

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data?

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data? Principal components analysis: construct projections of the data that capture most of the variability in the data.

Motivation High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data? Principal components analysis: construct projections of the data that capture most of the variability in the data. Provides a low-rank approximation to the data. Can lead to a signicant dimensionality reduction. 2/11

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p.

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p.

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p. We want to nd a linear combination w 1 X 1 + + w p X p with maximum variance. (Intuition: we look for a direction in R p where the data varies the most.)

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p. We want to nd a linear combination w 1 X 1 + + w p X p with maximum variance. (Intuition: we look for a direction in R p where the data varies the most.) We solve: w = argmax w 2 =1 n (x T i w) 2. (Note: n i=1 (xt i w)2 is proportional to the sample variance of the data since we assume each column of X has mean 0.) i=1

Principal component analysis (PCA) Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p. We want to nd a linear combination w 1 X 1 + + w p X p with maximum variance. (Intuition: we look for a direction in R p where the data varies the most.) We solve: w = argmax w 2 =1 n (x T i w) 2. (Note: n i=1 (xt i w)2 is proportional to the sample variance of the data since we assume each column of X has mean 0.) Equivalently, we solve: i=1 w = argmax(xw) T (Xw) = argmax w T X T Xw w 2 =1 w 2 =1 Claim: w is an eigenvector associated to the largest eigenvalue of X T X. 3/11

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x = Ax, x x, x, (x Rp, x 0 p 1 ).

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x = Ax, x x, x, (x Rp, x 0 p 1 ). Observations: 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x Observations: = Ax, x x, x, (x Rp, x 0 p 1 ). 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0 2 Let {λ 1,..., λ p } denote the eigenvalues of A, and let {v 1,..., v p } R p be an orthonormal basis of eigenvectors of A. If x = p i=1 θ iv i, then R(A, x) =. p i=1 λ iθ 2 i n i=1 θ2 i

Proof of claim: Rayleigh quotients Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x Observations: = Ax, x x, x, (x Rp, x 0 p 1 ). 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0 2 Let {λ 1,..., λ p } denote the eigenvalues of A, and let {v 1,..., v p } R p be an orthonormal basis of eigenvectors of A. If x = p i=1 θ p i=1 iv i, then R(A, x) = λ iθi 2 n. i=1 θ2 i It follows that sup R(A, x) λ max (A). x 0 Thus, sup x 0 R(A, x) = sup x 2 =1 x T Ax = λ max (A). 4/11

Back to PCA 5/11 Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X.

Back to PCA 5/11 Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X. First principal component: The linear combination p i=1 w(1) i X i is the rst principal component of (X 1,..., X p ). Alternatively, we say that Xw (1) is the rst (sample) principal component of X.

Back to PCA 5/11 Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X. First principal component: The linear combination p i=1 w(1) i X i is the rst principal component of (X 1,..., X p ). Alternatively, we say that Xw (1) is the rst (sample) principal component of X. It is the linear combination of the columns of X having the most variance.

Back to PCA Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X. First principal component: The linear combination p i=1 w(1) i X i is the rst principal component of (X 1,..., X p ). Alternatively, we say that Xw (1) is the rst (sample) principal component of X. It is the linear combination of the columns of X having the most variance. Second principal component: We look for a new linear combination of the X i 's that 1 Is orthogonal to the rst principal component, and 2 Maximizes the variance. 5/11

Back to PCA (cont.) 6/11 In other words: w (2) := argmax w 2 =1 w w (1) n i=1 (x T i w) 2 = argmax w 2 =1 w w (1) w T X T Xw.

Back to PCA (cont.) 6/11 In other words: w (2) := argmax w 2 =1 w w (1) n i=1 (x T i w) 2 = argmax w 2 =1 w w (1) w T X T Xw. Using a similar argument as before with Rayleigh quotients, we conclude that w (2) is an eigenvector associated to the second largest eigenvalue of X T X.

Back to PCA (cont.) 6/11 In other words: w (2) := argmax w 2 =1 w w (1) n i=1 (x T i w) 2 = argmax w 2 =1 w w (1) w T X T Xw. Using a similar argument as before with Rayleigh quotients, we conclude that w (2) is an eigenvector associated to the second largest eigenvalue of X T X. Similarly, given w (1),..., w (k), we dene w (k+1) := argmax w 2 =1 w w (1),w (2),...,w (k) n i=1 (x T i w) 2 = argmax w T X T Xw. w 2 =1 w w (1),w (2),...,w (k) As before, the vector w (k+1) is an eigenvector associated to the (k + 1)-th largest eigenvalue of X T X.

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.)

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X).

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU.

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU. Write U = (u 1,..., u p ). Then the variance of the i-th principal component is (Xu i ) T (Xu i ) = u T i X T Xu i = (U T X T XU) ii = Λ ii.

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU. Write U = (u 1,..., u p ). Then the variance of the i-th principal component is (Xu i ) T (Xu i ) = u T i X T Xu i = (U T X T XU) ii = Λ ii. Conclusion: The variance of the i-th principal component is the i-th eigenvalue of X T X.

PCA: summary In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU. Write U = (u 1,..., u p ). Then the variance of the i-th principal component is (Xu i ) T (Xu i ) = u T i X T Xu i = (U T X T XU) ii = Λ ii. Conclusion: The variance of the i-th principal component is the i-th eigenvalue of X T X. We say that the rst k PCs explain ( k i=1 Λ ii)/( p i=1 Λ ii) 100 percent of the variance. 7/11

Example: zip dataset 8/11 Recall the zip dataset: 1 9298 images of digits 0 9. 2 Each image is in black/white with 16 16 = 256 pixels. We use PCA to project the data onto a 2-dim subspace of R 256.

Example: zip dataset 8/11 Recall the zip dataset: 1 9298 images of digits 0 9. 2 Each image is in black/white with 16 16 = 256 pixels. We use PCA to project the data onto a 2-dim subspace of R 256. from sklearn.decomposition import PCA pc = PCA(n_components=10) pc.fit(x_train) print(pc.explained_variance_ratio_) plt.plot(range(1,11), np.cumsum(pc.explained_variance_ratio_))

Example: zip dataset 8/11 Recall the zip dataset: 1 9298 images of digits 0 9. 2 Each image is in black/white with 16 16 = 256 pixels. We use PCA to project the data onto a 2-dim subspace of R 256. from sklearn.decomposition import PCA pc = PCA(n_components=10) pc.fit(x_train) print(pc.explained_variance_ratio_) plt.plot(range(1,11), np.cumsum(pc.explained_variance_ratio_))

Example: zip dataset (cont.) 9/11 Projecting the data on the rst two principal components: Xt = pc.fit_transform(x_train).

Example: zip dataset (cont.) Projecting the data on the rst two principal components: Xt = pc.fit_transform(x_train). Note: 27% variance explained by the rst two PCAs. 90% variance explained by rst 55 components. 9/11

Principal component regression 10/11 PCAs can be directly used in a regression context.

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns)

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T.

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k.

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k. 4 Regress y on the principal components: ˆγ k := (W T k W k) 1 W T k y.

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k. 4 Regress y on the principal components: ˆγ k := (W T k W k) 1 W T k y. 5 The PCR estimator is: ˆβ k := U kˆγ k, ŷ (k) := X ˆβ k = XU k ˆβk.

Principal component regression PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k. 4 Regress y on the principal components: ˆγ k := (W T k W k) 1 W T k y. 5 The PCR estimator is: ˆβ k := U kˆγ k, ŷ (k) := X ˆβ k = XU k ˆβk. Note: k is a parameter that needs to be chosen (using CV or another method). Typically, one picks k to be signicantly smaller than p. 10/11

Projection pursuit 11/11 PCA looks for subspaces with the most variance.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data. Example:(Izenman, 2013) The absolute value of kurtosis, κ 4 (Y ), of the one-dimensional projection Y = w T X has been widely used as a measure of non-gaussianity of Y.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data. Example:(Izenman, 2013) The absolute value of kurtosis, κ 4 (Y ), of the one-dimensional projection Y = w T X has been widely used as a measure of non-gaussianity of Y. Recall: The marginals of the multivariate Gaussian distribution are Gaussian.

Projection pursuit PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data. Example:(Izenman, 2013) The absolute value of kurtosis, κ 4 (Y ), of the one-dimensional projection Y = w T X has been widely used as a measure of non-gaussianity of Y. Recall: The marginals of the multivariate Gaussian distribution are Gaussian. Can maximize/minimize the kurtosis to nd subspaces where data looks Gaussian/non-Gaussian (e.g. to detect outliers). 11/11