Dimensionality Reduction Algorithms!! Lecture 15!

Size: px

Start display at page:

Download "Dimensionality Reduction Algorithms!! Lecture 15!"

Cuthbert Jerome Anthony
5 years ago
Views:

1 Dimesioality Reductio Alorithms!! Lecture 15! 1!

2 Hih-dimesioal Spaces! Couter-ituitive properties! Volume of a d dimesioal sphere of radius r V d (r) = 2 d / 2 r d d# d & ) % 2( where Γ() is the amma fuctio. 2!

3 Hih-dimesioal Spaces! Couter-ituitive properties! Volume of a d dimesioal sphere of radius r V d (r) = 2 d / 2 r d d# d & ) % 2( r ε r-ε where Γ() is the amma fuctio. Cosider the outermost shell of thickess ε What is the fractio of volume cotaied i this shell? 3!

4 Hih-dimesioal Spaces! Couter-ituitive properties! Volume of a d dimesioal sphere of radius r V d (r) = 2 d / 2 r d d# d & ) % 2( r ε r-ε where Γ() is the amma fuctio. Cosider the outermost shell of thickess ε f d = V (r) V d d (r #) V d (r) = rd (r #) d r d =1 1 # & ) % r( d 4!

5 Hih-dimesioal Spaces! Couter-ituitive properties! Volume of a d dimesioal sphere of radius r V d (r) = 2 d / 2 r d d# d & ) % 2( Fractio of volume i outer most shell (e) e=0.02 e=0.05 e=0.10 e= 0.20 where Γ() is the amma fuctio Number of dimesioa Cosider the outermost shell of thickess ε f d = V (r) V d d (r #) V d (r) = rd (r #) d r d =1 1 # & ) % r( d 5!

6 Hih-dimesioal Spaces! Couter-ituitive properties! Cosider the outermost shell of thickess ε f d = V (r) V d d (r #) V d (r) Fractio of volume i outer most shell (e) e=0.02 e=0.05 e=0.10 e= Number of dimesioa This meas that i hih dimesioal spaces most of the volume is cocetrated aroud the surface (withi ε) of the hypersphere, ad the ceter is essetially void.! = rd (r #) d r d =1 1 # & ) % r( d 6!

7 Curse of Dimesioality! The curse of dimesioality! A term coied by Bellma i 1961 Refers to the problems associated with multivariate data aalysis as the dimesioality icreases We will illustrate these problems with a simple example Cosider a 3-class patter recoitio problem! A simple approach would be to Divide the feature space ito uiform bis Compute the ratio of examples for each class at each bi ad, For a ew example, fid its bi ad choose the predomiat class i that bi I our toy problem we decide to start with oe sile feature ad divide the real lie ito 3 semets After doi this, we otice that there exists too much overlap amo the classes, so we decide to icorporate a secod feature to try ad improve separability Notes borrowed from Gutierrez-Osua 7!

8 Curse of Dimesioality! We decide to preserve the raularity of each axis, which raises the umber of bis from 3 (i 1D) to 3 2 =9 (i 2D)! At this poit we eed to make a decisio: do we maitai the desity of examples per bi or do we keep the umber of examples we had for the oe-dimesioal case? Choosi to maitai the desity icreases the umber of examples from 9 (i 1D) to 27 (i 2D) Choosi to maitai the umber of examples results i a 2D scatter plot that is very sparse Movi to three features makes the problem worse:! The umber of bis rows to 3 3 =27 For the same desity of examples the umber of eeded examples becomes 81 For the same umber of examples, well, the 3D scatter plot is almost empty Notes borrowed from Gutierrez-Osua 8!

9 Curse of Dimesioality! Obviously, our approach to divide the sample space ito equally spaced bis was quite iefficiet! There are other approaches that are much less susceptible to the curse of dimesioality, but the problem still exists How do we beat the curse of dimesioality?! By icorporati prior kowlede By providi icreasi smoothess of the taret fuctio By reduci the dimesioality I practice, the curse of dimesioality meas that, for a ive sample size, there is a maximum umber of features above which the performace of our classifier will derade rather tha improve! I most cases, the additioal iformatio that is lost by discardi some features is (more tha) compesated by a more accurate mappi i the lower dimesioal space Notes borrowed from Gutierrez-Osua 9!

10 # Dimesioality Reductio! Two approaches are available to perform dimesioality reductio! x 1 x 2 x N Feature extractio: creati a subset of ew features by combiatios of the existi features Feature selectio: choosi a subset of all the features (the oes more iformative) % feature selectio ((((() & # x i1 x i2 x im % & # x 1 x 2 x N % feature extractio (((( () & # y 1 y 2 y M % = f & # x 1 x 2 x N % & The problem of feature extractio ca be stated as:! Give a feature space x i R N fid a mappi y=f(x):r N R M with M<N such that the trasformed feature vector y i R M preserves (most of) the iformatio or structure i R N. 10! Notes borrowed from Gutierrez-Osua

11 Dimesioality Reductio! I eeral, the optimal mappi y=f(x) will be a o-liear fuctio! # However, there is o systematic way to eerate o-liear trasforms The selectio of a particular subset of trasforms is problem depedet For this reaso, feature extractio is commoly limited to liear trasforms: y=wx That is, y is a liear projectio of x x 1 x 2 : : x N % & NOTE: Whe the mappi is a o-liear fuctio, the reduced space is called a maifold feature extractio (((( () # y 1 y 2 : y M % w 11 w w 1N % w 21 :... : = : :... : & # w M 1... w MN & # We will focus o liear feature extractio for ow, ad revisit o-liear techiques if we have time! x 1 x 2 : : x N % & 11! Notes borrowed from Gutierrez-Osua

12 Dimesioality Reductio! The selectio of the feature extractio mappi y=f(x) is uided by a objective fuctio that we seek to maximize (or miimize)! Depedi o the criteria used by the objective fuctio, feature extractio techiques are rouped ito two cateories:! Sial represetatio: The oal of the feature extractio mappi is to represet the samples accurately i a lower-dimesioal space Classificatio: The oal of the feature extractio mappi is to ehace the classdiscrimiatory iformatio i the lower-dimesioal space Withi the realm of liear feature extractio, two techiques are commoly used! Pricipal Compoets Aalysis (PCA) uses a sial represetatio criterio Liear Discrimiat Aalysis (LDA) uses a sial classificatio criterio Notes borrowed from Gutierrez-Osua 12!

13 PCA Letʼs build some ituitio! µ = 0 % ( = 2 2 % # 0& # 2 3& Eievectors of covariace matrix! x = = ! v = #0.7882! & ) v = & ) % ( % ( x1 13!!

14 PCA Recap! The objective of PCA is to perform dimesioality reductio while preservi as much of the variace i the hih-dimesioal space as possible! Let x be a D-dimesioal radom vector, represeted as a liear combiatio of orthoormal basis vectors [ϕ 1 ϕ 2... ϕ D ] To optimally represet x with miimum sum-square error, we will choose the eievectors ϕ i of the covaraice matrix correspodi to the larest eievalues λ i x # i = i # i where Σ x is the covariace matrix 14!

15 PCA Trick: Sapshot PCA! Is some applicatio (e.. imae processi) the umber of dimesios (# of pixels) is far reater tha the data poits (# of imaes)! A D-dimesioal space ad N data poits, where N < D, defies a liear subspace whose dimesioality is at most N-1 Reular PCA! Implies there is o poit applyi PCA for values of M > N-1 D-N+1 of the eievalues will be zero Dataset X is D by N matrix (i.e. colum vectors) Computatioal cost for fidi DxD eievectors is o the order of O(D 3 ) 1 N!1 XX T! i = i! i 15!

16 PCA Trick: Sapshot PCA! Is some applicatio (e.. imae processi) the umber of dimesios (# of pixels) is far reater tha the data poits (# of imaes)! A D-dimesioal space ad N data poit, where N < D, defies a liear subspace whose dimesioality is at most N-1 Implies there is o poit applyi PCA for values of M > N-1 D-N+1 of the eievevalues will be zero Computatioal cost for fidi DxD eievectors is o the order of O(D 3 ) Trick: Sapshot PCA! Dataset X is D by N matrix 1 N!1 X T Xv i =! i v i Pre-mulitply both sides by X 1 N!1 XX T (Xv i ) =! i (Xv i )!#### #### eievalue problem 16!

17 PCA Trick: Sapshot PCA! Is some applicatio (e.. imae processi) the umber of dimesios (# of pixels) is far reater tha the data poits (# of imaes)! A D-dimesioal space ad N data poit, where N < D, defies a liear subspace whose dimesioality is at most N-1 Implies there is o poit applyi PCA for values of M > N-1 D-N+1 of the eievevalues will be zero Computatioal cost for fidi DxD eievectors is o the order of O(D 3 ) Trick: Sapshot PCA! Dataset X is D by N matrix 1 N!1 XX T (! i ) =! i (! i )!## # #### eievalue problem [ ]! i = Xv i Note: X T X is a NxN matrix so the computatioal cost is O(N 3 )< O(D 3 ) 17!

18 PCA Trick: Sapshot PCA! Is some applicatio (e.. imae processi) the umber of dimesios (# of pixels) is far reater tha the data poits (# of imaes)! A D-dimesioal space ad N data poit, where N < D, defies a liear subspace whose dimesioality is at most N-1 Implies there is o poit applyi PCA for values of M > N-1 D-N+1 of the eievevalues will be zero Computatioal cost for fidi DxD eievectors is o the order of O(D 3 ) Trick: Sapshot PCA! Dataset X is N by D matrix Premultiply both sides by X T # 1 N X T X % X T v & i ( ) = ( ( i X T v ) i X t v i is eie vector of the oriial DxD covariace matrix (rescale to obtai uit vectors) 18!

19 Siular Value Decompositio! Ay matrix A ca be writte as:! A = U W m! m!m m! UU T = I m!m VV T = I! W = # T V!!1 0 0! 0 0! 2 0!! r 0 0 % & 19!

20 Siular Value Decompositio! Ay matrix A ca be writte as:! T A = U W V m! m!m m!!! AA T = (UWV T )(UWV T ) T AA T =UWV T VW T U T AA T =UW I W T U T! AA T =U W W T U T m!!m AA T =U W m!m 2 U T Post-multiply both sides U AA T U =U W 2 U T U m!m AA T U =U 20!

21 Siular Value Decompositio! Ay matrix A ca be writte as:! A = U T W V m! m!m m!!! A T A = (UWV T ) T (UWV T ) A T A = VW T U T UWV T A T A = VW T A T A = V W T!m A T A = V W! Post-multiply both sides V I WV T m!m W V T m! 2 V T A T AV = AW 2 V T V! A T AV = V 21!

22 Siular Value Decompositio! Gilbert Stra diaram to represet four fudametal subspaces!! 22!

23 Siular Value Decompositio! Gilbert Stra diaram to represet four fudametal subspaces!! 23!

24 Siular Value Decompositio! Gilbert Stra diaram to represet four fudametal subspaces!! 24!

25 Siular Value Decompositio! Gilbert Stra diaram to represet four fudametal subspaces!! 25!

26 Siular Value Decompositio! Ituitio Gilbert Stra diaram:!! 26!

27 Siular Value Decompositio! Ituitio Gilbert Stra diaram:!! 27!

28 SVD DEMO!! MATLAB commads! load madrill [U W V] = svd(x); fiure;imaesc(x);colormap(map); Xcap = U(:,1:100)*W(1:100,1:100)*V(:,1:100); fiure;imaesc(xcap); 28!

29 Liear Discrimiat Aalysis! The objective of LDA is to perform dimesioality reductio while preservi as much of the class discrimiatory iformatio as possible! Assume we have a set of D-dimesioal samples {x (1, x (2,, x (N }, N 1 of which belo to class C1, ad N 2 to class C2. We seek to obtai a scalar y by projecti the samples x oto a lie Of all the possible lies we would like to select the oe that maximizes the separability of the scalars! This is illustrated for the two-dimesioal case i the followi fiures 29! Notes borrowed from Gutierrez-Osua

30 LDA: Some ituitio first! Samples from 2 classes alo with historam resulti from projectio oto the lie joii the class meas! Whatʼs the problem here? Bishop, PRML 30!

31 LDA: Some ituitio first! Samples from 2 classes alo with historam resulti from LDA projectio! Whatʼs the differece? Bishop, PRML 31!

32 Liear Discrimiat Aalysis! I order to fid a ood projectio vector, we eed to defie a measure of separatio betwee the projectios! The mea vector of each class i x ad y feature space is µ i = 1 # x µ N i = 1 # y = 1 # w T x = w T µ i N i N i i xc 1 We could the choose the distace betwee the projected meas as our objective fuctio xc 1 xc 1 J(w) =!µ 1!!µ 2 = w T (µ 1! µ 2 ) However, the distace betwee the projected meas is ot a very ood measure sice it does ot take ito accout the stadard deviatio withi the classes Notes borrowed from Gutierrez-Osua 32!

33 LDA! The solutio proposed by Fisher is to maximize a fuctio that represets the differece betwee the meas, ormalized by a measure of the withi-class scatter! For each class we defie the scatter, a equivalet of the variace, as s i 2 = y#c1 ( y µ i ) 2 where the quatity s s 2 is called the withi-class scatter of the projected examples The Fisher liear discrimiat is defied as the liear fuctio w T x that maximizes the criterio fuctio! Therefore, we will be looki for a projectio where examples from the same class are projected very close to each other ad, at the same time, the projected meas are as farther apart as possible J(w) = µ 1 µ 2 s s 2 33! Notes borrowed from Gutierrez-Osua

34 LDA! I order to fid the optimum projectio w*, we eed to express J(w) as a explicit fuctio of w! We defie a measure of the scatter i multivariate feature space x, which are scatter matrices S i = x µ i where S W is called the withi-class scatter matrix The scatter of the projectio y ca the be expressed as a fuctio of the scatter matrix i feature space x! s i = ( y µ i )( y µ i ) T = w T x w T 2 µ i = w T ( x µ i )( x µ i ) T w = w T S i w Similarly, the differece betwee the projected meas ca be expressed i terms of the meas i the oriial feature space! y#ci ( )( x µ i ) T x#ci S 1 + S 2 = S W x#ci ( ) The matrix S B is called the betwee-class scatter. Note that, sice S B is the outer product of two vectors, its rak is at most oe We ca fially express the Fisher criterio i terms of S W ad S B as! x#ci s 1 + s 2 = w T S w w µ 1 µ 2 = ( w T µ 1 w T µ 2 ) 2 = w T ( µ 1 µ 2 )( µ 1 µ 2 ) T w = w T S B w J(w) = wt S B w w T S w w 34! Notes borrowed from Gutierrez-Osua

35 LDA! To fid the maximum of J(w) take derivative w.r.t w ad equate umerator to zero:! dj(w) dw = wt S w w d dw wt S B w # w T S w w 2S B w w T S B w 2S w w = 0 # w T S w ws B w = w T S B ws w w ( ) w T S B w d ( ) dw wt S w w Pre-multiply both sides by (w T S w w) -1! S B w = wt S B w w T S w w S w w = S B w = J(w)S w w S! #1 w # S B # w = # J(w)w # eievalue problem Solvi the eeralized eievalue problem (S W -1 S B w=jw) yields!! w* = armax wt S B w w T S w w S #1 W (µ 1 # µ 2 ) This is kow as Fisherʼs Liear Discrimiat (1936), althouh it is ot a discrimiat but rather a specific choice of directio for the projectio of the data dow to oe dimesio Notes borrowed from Gutierrez-Osua 35!

36 LDA: Some ituitio first! Samples from 2 classes alo with historam resulti from LDA projectio! Whatʼs the differece? Bishop, PRML 36!

37 Multi-class LDA! Fisherʼs LDA eeralizes very racefully for C-class problems! Istead of oe projectio y, we will ow seek (C-1) projectios [y 1,y 2,,y C-1 ] by meas of (C-1) projectio vectors w i, which ca be arraed by colums ito a projectio matrix W=[w 1 w 2 w C-1 ]:! The eeralizatio of the! withi-class scatter is! S i = S W = The eeralizatio for the! i=1 betwee-class scatter is! C ( x # µ i )( x # µ i ) T ad µ i = 1 N xci S i y i = w T x i y = w T x xci x C #( )( µ i µ ) T S B = µ i µ i=1 µ = 1 # x = 1 N N x # N i µ i x%ci where S T =S B +S W is called the total scatter matrix 37! Notes borrowed from Gutierrez-Osua

38 Multi-class LDA! Similarly, we defie the mea vector ad scatter matrices for the projected samples as! µ i = 1 N µ = 1 N # y S w = yci ## i=1 yci C ( y µ i ) y # y S B = # N i µ i µ %y ( µ i ) T From our derivatio for the two-class problem, we ca write! S B = w T S B w S w = w T S w w C i=1 ( )( µ i µ ) T Recall that we are looki for a projectio that maximizes the ratio of betwee-class to withi-class scatter. Sice the projectio is o loer a scalar (it has C-1 dimesios), we the use the determiat of the scatter matrices to obtai a scalar objective fuctio:! J(w) = wt S B w w T S w w Ad we will seek the projectio matrix W* that maximizes this ratio! 38!

39 Multi-class LDA! It ca be show that the optimal projectio matrix W* is the oe whose colums are the eievectors correspodi to the larest eievalues of the followi eeralized eievalue problem! [ ] = armax W T S B W W * = w * 1 w * * 2...w C 1 #% &% W T S W W % ( )% * ( S + S B i W )w * i = 0 NOTE:! S B is the sum of C matrices of rak oe or less ad the mea vectors are costraied by C µ = 1 C Therefore, S B will be of rak (C-1) or less This meas that oly (C-1) of the eievalues λ i will be o-zero The projectios with maximum class separability iformatio are the eievectors correspodi to the larest eievalues of S -1 W S B i=1 µ i 39! Notes borrowed from Gutierrez-Osua

40 LDA Multiclass Example! Eievector 2 (S w -1 S B ) Eievector 1 (S w -1 S B ) Eievector 2 (S w -1 SB ) Eievector 1 (S w -1 S B ) 40!

41 LDA limitatios! LDA produces at most C-1 feature projectios! If the classificatio error estimates establish that more features are eeded, some other method must be employed to provide those additioal features LDA is a parametric method sice it assumes uimodal Gaussia likelihoods! If the distributios are siificatly o-gaussia, the LDA projectios will ot be able to preserve ay complex structure of the data, which may be eeded for classificatio LDA will fail whe the discrimiatory iformatio is ot i the mea but rather i the variace of the data! 41! Notes borrowed from Gutierrez-Osua

42 Trial-by-Trial Variability! How does the odor! represetatio chae over trials?! 10 trials Extracellular recordi 42!

43 Dataset! ufoldi! 2-d matrix! Tesor! 43!

44 PCA vs LDA! 44!

45 Caoical Correlatio Aalysis! Suppose we have two datasets X (N,D1) ad Y(N, D2), ad we would like to fid basis vectors w x ad w y such that the correlatio betwee the projectio of the variables oto the basis vector are mutually maximized! x = w T xx y = w T yy! = E! xy T # = E! x 2 # E! # y2 E! w T x XY T w y # E! w T x XX T w x # E! w T yyy T w y #! = w x T C xy w x w x T C xx w x w y T C yy w y 45!

46 Caoical Correlatio Aalysis! CCA:! armax w x,w y = w x T C xy w x w x T C xx w x w y T C yy w y subject to costrait s : w x T C xx w x =1 w y T C yy w y =1 46!

47 Caoical Correlatio Aalysis! Take partial derivatives w.r.t. w x ad w y :! J(w x, w y ) = 1 2 w T x C xy w y +! x ( ) +! x 2 1! w T x C xx w x!j = C xy w y! x C xx w x = 0 C xy w y = x C xx w x!w x pre! mulitply by w x T ( ) 2 1! w T y C yy w y w T x C xy w y = x w T x C xx w x w T x C xy w y = x!j = C yx w x! y C yy w y = 0 C yx w x = y C yy w y # T C xy= = C yx % &!w y T pre! mulitply by w y w y T C yx w x = y w y T C yy w y 47! w y T C yx w x = y

48 Caoical Correlatio Aalysis! Take partial derivatives w.r.t. w x ad w y :! w x T C xy w y = x w y T C yx w x = y Note: w xt C xy w y =(w yt C yx w x ) T!! x = y Also:!!! C yx w x = C yy w y # w y = 1 C 1 yyc yx w x Substituti w y back to et w x! C xy w y = C xx w x # 1 C C 1 xy yyc yx w x = C xx w x # C 1 xx C xy C 1 yy C yx w x = 2 w!## # ### x eievalue problem 48!

49 Caoical Correlatio Aalysis! The caoical correlatios betwee x ad y ca be foud by solvi the eievalue equatios! C 1 xx C xy C 1 yy C yx w x = # 2 w x C 1 yy C yx C 1 xx C xy w y = # 2 w y Whe CCA is performed o a traii set X, ad a dummy matrix represeti roup membership Y, CCA directios are just LDA directios! 49!

50 Caoical Correlatio Aalysis! Eievector 2 Eievector 1 C 1 xx C xy C 1 yy C yx 50!

51 Relatio to other methods! Istead of the two eievalue problems we ca formulate the problem i oe sile eievalue equatio:! Bora, !

52 Relatio to other methods! Bora, !

LECTURE 17: Linear Discriminant Functions

LECTURE 17: Linear Discriminant Functions LECURE 7: Liear Discrimiat Fuctios Perceptro leari Miimum squared error (MSE) solutio Least-mea squares (LMS) rule Ho-Kashyap procedure Itroductio to Patter Aalysis Ricardo Gutierrez-Osua exas A&M Uiversity