Principal Component Analysis

Size: px

Start display at page:

Download "Principal Component Analysis"

Christian Atkinson
5 years ago
Views:

1 Principal Component Analysis Motivation Principal Component Analysis (PCA) is a multivariate statistical technique that is often useful in reducing dimensionality of a collection of unstructured random variables for analysis and interpretation. Problem Statement Let be an D-dimensional random vector with covariance matri C. The problem is to consecutively find the unit vectors a 1, a 2,..., a D such that y i = t a i with Y i = t a i satisfies 1. var(y 1 ) is the maimum. 2. var(y 2 ) is the maimum subject to cov(y 2,Y 1 )=0. 3. var(y k ) is the maimum subject to cov(y k,y i )=0, where k =3, 4,,D and k>i. Y i is called the i-th principal component Feature etraction by PCA is called PCP The Solution Let (λ i, u i ) be the pairs of eigenvalues and eigenvectors of C such that λ 1 λ 2... λ D and u i 2 =1, 1 i D. Thena i = u i and var(y i )=λ i for 1 i D. 1

2 Methodology of Practical PCA Given observations 1, 2,..., n. 1. Compute m = 1 n n i by MLE 2. Compute the covariance matri C = 1 n n ( i m)( i m) t by MLE 3. Compute the eigenvalue/eigenvector pairs (λ i, u i )ofc 4. Compute the first d principal components y (j) i n, along the direction u j, j =1, 2,,d. = t i u j, for each observation i,1 i 2

3 Images of Characters,,,3 3

4 Figure 1: The First Two PCP of Data 4

5 Figure 2: The 3rd and 4th PCP of Data 5

6 o o o ooo oo o oo o ooo oo o o o o o o Figure 3: The First Two PCP of iris Data 6

7 o oo ooooooo oo o o oo o Figure 4: The 3rd and 4th PCP of iris Data 7

8 Generate Two Sets of Points in Elongated Regions Script File: pcano.m Generate a set of long shaped data in two categories and show that PCA does ot pick up the desired direction n=20; d=2; r=5; 1=random( Uniform,-r,r,n,1); Y1=random( Uniform,0,1,n,1); 2=random( Uniform,-r,r,n,1); Y2=random( Uniform,-1,0,n,1); for :21 h(i)= *i; Yh(i)=0; end for :n (i,1)=2(i,1); (i,2)=y2(i,1); (i+n,1)=1(i,1); (i+n,2)=y1(i,1); end Find the first principal direction [n,d]=size(); C=cov(); [U D]=eig(C); L=diag(D); L U principal component directions plot(1,y1, b^,2,y2, ro,h,yh, g- ); ais([-r, r, -2 2]); grid; legend( The 1st principal direction is [-1.0, 0.0] ); title( An Eample of PCA fails )

9 Script File: Compute the First K Principal Components Script file: PCA.m Compute the first K Principal Components function Y=PCA(,K) [n,d]=size(); C=cov(); [U D]=eig(C); L=diag(D); [sorted inde]=sort(l); proj=zeros(k,d); for :K proj(i,:)=u(:,inde(d-i+1)) ; end Y=(proj* ) ; initiate a projection matri first K principal components 9

10 An Eample that PCA Fails 2 An Eample of PCA fails The 1st principal direction is [ 1.0, 0.0] Figure 5: An Eample that PCA Fails 10

11 GenerateTwoSetsofPointsfromGaussian Distributions Script File: pcayes.m Generate a set of elliptical-shaped data in two categories and show that PCA really picks up the desired direction n=20; d=2; 1=random( Normal,2.0,1,n,1); Y1=random( Normal,2.0,1,n,1); 2=random( Normal,-2.0,1,n,1); Y2=random( Normal,-2.0,1,n,1); h=-4:0.5:4; Yh=-4:0.5:4; for :n (i,1)=2(i,1); (i,2)=y2(i,1); (i+n,1)=1(i,1); (i+n,2)=y1(i,1); end Find the first principal direction [n,d]=size(); C=cov(); [U D]=eig(C); L=diag(D); L U principal component directions plot(1,y1, b^,2,y2, ro,h,yh, g- ); ais([-4,4, -4,4]); grid; legend( The 1st principal direction is [1.0, 1.0] ); title( An Eample of PCA works ) 11

12 An Eample that PCA Works 4 An Eample of PCA works The 1st principal direction is [1.0, 1.0] Figure 6: An Eample that PCA Works 12

13 Fundamentals of Linear Discriminant Analysis Given the training patterns 1, 2,..., n from K categories, where n 1 + n n K = n. Let the between-class scatter matri B, the within-class scatter matri W, and the total scatter matri T be defined below. B = K n i (u i u)(u i u) t,whereu i is the mean of ith category, u = 1 n i, n W = K ω i ( u i )( u i ) t. T = n ( i u)( i u) t. Show that B + W = T. Linear discriminant analysis for a dichotomous problem attempts to find an optimal direction w for projection which maimizes a Fisher s discriminant ratio J(w) = (m 1 m 2 ) 2 s s 2 2 = wt Bw w t W w (1) where y i = w t i, 1 i n 1, y j = w t j, n 1 <j n, n 1 + n 2 = n m k = w t u k, k =1, 2 s 2 1 = 1 n 1 (y i m 1 ) 2 n 1 n s 2 2 = 1 (y j m 2 ) 2 n 2 j=1+n 1 The problem could be reduced to solving the generalized eigenvalue problem of Bw = λw w, where λ = J(w). 13

14 Linear Discriminant Analysis (1) Let p( ω i ) be arbitrary densities with mean vectors u i and covariance matrices C i (not necessarily normal) for i =1, 2. Let y = w t be a projection, and let the induced densities p(y ω i ) have means and variances, µ i and σ 2 i, respectively. (a) Show that the criterion function J 1 (w) =(µ 1 µ 2 ) 2 /(σ σ2 2 ) is maimized by w =(C 1 + C 2 ) 1 (u 1 u 2 ) Hint: E(w t ω i )=µ i and Var(w t ω i )=σ 2 i for i =1, 2. (b) If the prior probability for ω i is denoted by p i = P (ω i ), show that J 2 (w) =(µ 1 µ 2 ) 2 /(p 1 σ p 2σ 2 2 ) is maimized by w =[p 1 C 1 + p 2 C 2 ] 1 (u 1 u 2 ) (c) The Fisher linear discriminant function employs that a linear function w t for which the criterion function J is maimized, where J(w) = (m 1 m 2 ) 2 s 2 1 +s2 2 where y j = t jw if j ω 1 and z k = t kw if k ω 2,and m 1 = 1 n 1 n 1 j=1 y j, m 2 = 1 n 2 n 2 k=1 z k, s 2 1 = 1 n 1 (y j m 1 ) 2, s 2 2 n = 1 n 2 (z k m 2 ) 2, 1 n 2 (d) To which of these criterion functions J 1 or J 2 is the J(w) more closely related? j=1 k=1 (2) Let A R n n be a positive definite matri and, b R n, β R. Find the criterion such that g() = 1 2 t A b t β is minimized. What is the minimum value of g(), R n? 14

15 Discriminant Analysis The objective of this method is to find the optimal set of discriminant vectors in order to separate the predefined classes of objects or events. The material is based on the paper [Duchene and Leclercq, pp.97 93, IEEE Trans. PAMI 19] Let the between-class scatter matri B, the within-class scatter matri W, and the total scatter matri T be defined below. B = K n i (u i u)(u i u) t,whereu i is the mean of ith category, u = 1 n i is the sample mean n W = K ω i ( u i )( u i ) t T = n ( i u)( i u) t, where T = B + W Define the criterion C 1 = vt Bv v t T v, C 2 = vt Bv v t W v The classical discriminant analysis finds an optimal set of discriminant vectors by the following steps. (1) Lookforaunitvectoru 1 which maimizes C 2, where u 1 could be the eigenvector corresponding to the largest eigenvalue of W 1 B. (2) Lookforaunitvectoru 2 which maimizes C 2 subject to u t 2 W u 1 =0. (3) Look for a unit vector u k which maimizes C 2 subject to u t kw u j = 0 for k 3, 1 j<k. {u j } is an optimal set of vectors which best discriminates the patterns. Note that u j may not be orthogonal vectors. Duchene and Leclercq suggest that u t k u j =0andu t k W u k =1 be used for step (3) and showed by eperiments that their proposed method improved the traditional one. 15

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal