MATH 829: Introduction to Data Mining and Analysis Principal component analysis

1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p.

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data?

Motivation 2/11 High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data? Principal components analysis: construct projections of the data that capture most of the variability in the data.

Motivation High-dimensional data often has a low-rank structure. Most of the action may occur in a subspace of R p. Problem: How can we discover low dimensional structures in data? Principal components analysis: construct projections of the data that capture most of the variability in the data. Provides a low-rank approximation to the data. Can lead to a signicant dimensionality reduction. 2/11

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p.

Principal component analysis (PCA) 3/11 Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p. We want to nd a linear combination w 1 X 1 + + w p X p with maximum variance. (Intuition: we look for a direction in R p where the data varies the most.) We solve: w = argmax w 2 =1 n (x T i w) 2. (Note: n i=1 (xt i w)2 is proportional to the sample variance of the data since we assume each column of X has mean 0.) i=1

Principal component analysis (PCA) Let X R n p with rows x 1,..., x n R p. We think of X as n observations of a random vector (X 1,..., X p ) R p. Suppose each column has mean 0, i.e., n i=1 x i = 0 1 p. We want to nd a linear combination w 1 X 1 + + w p X p with maximum variance. (Intuition: we look for a direction in R p where the data varies the most.) We solve: w = argmax w 2 =1 n (x T i w) 2. (Note: n i=1 (xt i w)2 is proportional to the sample variance of the data since we assume each column of X has mean 0.) Equivalently, we solve: i=1 w = argmax(xw) T (Xw) = argmax w T X T Xw w 2 =1 w 2 =1 Claim: w is an eigenvector associated to the largest eigenvalue of X T X. 3/11

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x = Ax, x x, x, (x Rp, x 0 p 1 ).

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x = Ax, x x, x, (x Rp, x 0 p 1 ). Observations: 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0

Proof of claim: Rayleigh quotients 4/11 Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x Observations: = Ax, x x, x, (x Rp, x 0 p 1 ). 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0 2 Let {λ 1,..., λ p } denote the eigenvalues of A, and let {v 1,..., v p } R p be an orthonormal basis of eigenvectors of A. If x = p i=1 θ iv i, then R(A, x) =. p i=1 λ iθ 2 i n i=1 θ2 i

Proof of claim: Rayleigh quotients Let A R p p be a symmetric (or Hermitian) matrix. The Rayleigh quotient is dened by R(A, x) = xt Ax x T x Observations: = Ax, x x, x, (x Rp, x 0 p 1 ). 1 If Ax = λx with x 2 = 1, then R(A, x) = λ. Thus, sup R(A, x) λ max (A). x 0 2 Let {λ 1,..., λ p } denote the eigenvalues of A, and let {v 1,..., v p } R p be an orthonormal basis of eigenvectors of A. If x = p i=1 θ p i=1 iv i, then R(A, x) = λ iθi 2 n. i=1 θ2 i It follows that sup R(A, x) λ max (A). x 0 Thus, sup x 0 R(A, x) = sup x 2 =1 x T Ax = λ max (A). 4/11

Back to PCA 5/11 Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X.

Back to PCA 5/11 Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X. First principal component: The linear combination p i=1 w(1) i X i is the rst principal component of (X 1,..., X p ). Alternatively, we say that Xw (1) is the rst (sample) principal component of X.

Back to PCA Previous argument shows that w (1) = argmax w 2 =1 n i=1 (x T i w) 2 = argmax w 2 =1 w T X T Xw is an eigenvector associated to the largest eigenvalue of X T X. First principal component: The linear combination p i=1 w(1) i X i is the rst principal component of (X 1,..., X p ). Alternatively, we say that Xw (1) is the rst (sample) principal component of X. It is the linear combination of the columns of X having the most variance. Second principal component: We look for a new linear combination of the X i 's that 1 Is orthogonal to the rst principal component, and 2 Maximizes the variance. 5/11

Back to PCA (cont.) 6/11 In other words: w (2) := argmax w 2 =1 w w (1) n i=1 (x T i w) 2 = argmax w 2 =1 w w (1) w T X T Xw.

Back to PCA (cont.) 6/11 In other words: w (2) := argmax w 2 =1 w w (1) n i=1 (x T i w) 2 = argmax w 2 =1 w w (1) w T X T Xw. Using a similar argument as before with Rayleigh quotients, we conclude that w (2) is an eigenvector associated to the second largest eigenvalue of X T X. Similarly, given w (1),..., w (k), we dene w (k+1) := argmax w 2 =1 w w (1),w (2),...,w (k) n i=1 (x T i w) 2 = argmax w T X T Xw. w 2 =1 w w (1),w (2),...,w (k) As before, the vector w (k+1) is an eigenvector associated to the (k + 1)-th largest eigenvalue of X T X.

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.)

PCA: summary 7/11 In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU. Write U = (u 1,..., u p ). Then the variance of the i-th principal component is (Xu i ) T (Xu i ) = u T i X T Xu i = (U T X T XU) ii = Λ ii.

PCA: summary In summary, suppose X T X = UΛU T where U R p p is an orthogonal matrix and Λ R p p is diagonal. (Eigendecomposition of X T X.) Recall that the columns of U are the eigenvectors of X T X and the diagonal of Λ contains the eigenvalues of X T X (i.e., the singular values of X). Then the principal components of X are the columns of XU. Write U = (u 1,..., u p ). Then the variance of the i-th principal component is (Xu i ) T (Xu i ) = u T i X T Xu i = (U T X T XU) ii = Λ ii. Conclusion: The variance of the i-th principal component is the i-th eigenvalue of X T X. We say that the rst k PCs explain ( k i=1 Λ ii)/( p i=1 Λ ii) 100 percent of the variance. 7/11

Example: zip dataset 8/11 Recall the zip dataset: 1 9298 images of digits 0 9. 2 Each image is in black/white with 16 16 = 256 pixels. We use PCA to project the data onto a 2-dim subspace of R 256.

Example: zip dataset 8/11 Recall the zip dataset: 1 9298 images of digits 0 9. 2 Each image is in black/white with 16 16 = 256 pixels. We use PCA to project the data onto a 2-dim subspace of R 256. from sklearn.decomposition import PCA pc = PCA(n_components=10) pc.fit(x_train) print(pc.explained_variance_ratio_) plt.plot(range(1,11), np.cumsum(pc.explained_variance_ratio_))

Example: zip dataset (cont.) 9/11 Projecting the data on the rst two principal components: Xt = pc.fit_transform(x_train).

Example: zip dataset (cont.) Projecting the data on the rst two principal components: Xt = pc.fit_transform(x_train). Note: 27% variance explained by the rst two PCAs. 90% variance explained by rst 55 components. 9/11

Principal component regression 10/11 PCAs can be directly used in a regression context.

Principal component regression 10/11 PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k.

Principal component regression PCAs can be directly used in a regression context. Principal component regression: y R n 1, X R n p. 1 Center y and each column of X (i.e., subtract mean from the columns) 2 Compute the eigen-decomposition of X T X: X T X = UΛU T. 3 Compute k 1 principal components: W k := (Xu 1,..., Xu k ) = XU k, where U = (u 1,..., u p ), and U k = (u 1,..., u k ) R p k. 4 Regress y on the principal components: ˆγ k := (W T k W k) 1 W T k y. 5 The PCR estimator is: ˆβ k := U kˆγ k, ŷ (k) := X ˆβ k = XU k ˆβk. Note: k is a parameter that needs to be chosen (using CV or another method). Typically, one picks k to be signicantly smaller than p. 10/11

Projection pursuit 11/11 PCA looks for subspaces with the most variance.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria.

Projection pursuit 11/11 PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data. Example:(Izenman, 2013) The absolute value of kurtosis, κ 4 (Y ), of the one-dimensional projection Y = w T X has been widely used as a measure of non-gaussianity of Y.

Projection pursuit PCA looks for subspaces with the most variance. Can also optimize other criteria. Projection pursuit (PP): 1 Set up a projection index to judge the merit of a particular one or two-dimensional projection of a given set of multivariate data. 2 Use an optimization algorithm to nd the global and local extrema of that projection index over all 1/2-dimensional projections of the data. Example:(Izenman, 2013) The absolute value of kurtosis, κ 4 (Y ), of the one-dimensional projection Y = w T X has been widely used as a measure of non-gaussianity of Y. Recall: The marginals of the multivariate Gaussian distribution are Gaussian. Can maximize/minimize the kurtosis to nd subspaces where data looks Gaussian/non-Gaussian (e.g. to detect outliers). 11/11