y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Feature Selecton: Lnear ransforatons new = M x old Constrant Optzaton (nserton) 3 Proble: Gven an objectve functon f(x) to be optzed and let constrants be gven b h k (x)=c k, ovng constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) ust have contnuous frst partal dervatves A Soluton: Lagrangan Multplers or startng wth the Lagrangan : 0 = x f(x) + Σ x λ k g k (x) L (x,λ) = f(x) + Σ λ k g k (x). wth x L (x,λ) = 0.

he Covarance Matrx (nserton) 4 Defnton Let x = {x,..., x N } N be a real valued rando varable (data vectors), wth the expectaton value of the ean E[x] = μ. We defne the covarance atrx Σ x of a rando varable x as Σ x := E[ (x- μ) (x- μ) ] wth atrx eleents Σ j = E[ (x - μ ) (x j - μ j ) ]. Applcaton: Estatng E[x] and E[ (x - E[x] ) (x - E[x] ) ] fro data. We assue saples of the rando varable x = {x,..., x N } N that s we have a set of vectors { x,..., x } N or when put nto a data atrx X N x Maxu Lkelhood estators for μ and Σ x are: x k k ML ( x )( x ) ML k k k M L M L XX KL/PCA Motvaton 5 Fnd eanngful drectons n correlated data Lnear densonalt reducton Vsualzaton of hgher densonal data Copresson / Nose reducton PDF-Estate

7 Karhunen-Loève ransfor: st Dervaton Proble Let x = {x,..., x N } N be a feature vector of zero ean, real valued rando varables. We seek the drecton a of axu varance: == > = a x for whch a s such as E[ ] s axu wth the constrant that a a = hs s a constraned optzaton use of the Lagrangan: L(a, λ ) = E[a x x a ] λ ( a a ) = a Σ x a λ ( a a ) Lagrange ultpler Karhunen-Loève ransfor 8 L(a, λ ) = a Σ x a λ ( a a ) L( a, ) for E[ ] to be axu : a 0 => Σ x a λ a = 0 => a ust be egenvector of Σ x wth egenvalue λ. E[ ] = a Σ x a = λ => for E[ ] to be axu, λ ust be the largest egenvalue. 3

Karhunen-Loève ransfor 9 Now let s search for a second drecton, a, such that: = a x such as E[ ] s axu, and a a = 0 and a a = Slar dervaton: L(a, λ ) = a Σ x a λ ( a a ) wth a a = 0 => a ust be the egenvector of Σ x assocated wth the second largest egenvalue λ. We can derve N orthonoral drectons that axze the varance: A = [a, a,, a N ] and = A x he resultng atrx A s known as Prncpal Coponent Analss (PCA) or Kharunen-Loève transfor (KL) = A x x a N 0 Karhunen-Loève ransfor: nd Dervaton Proble Let x = {x,..., x N } N be a feature vector of zero ean, real valued rando varables. We seek a transforaton A of x that results n a new set of varables = A x (feature vectors) whch are uncorrelated (.e. E[, j ]= 0 for j ). Let = A x, then b defnton of the correlaton atrx: R E [ ] E [ A xx A] A R A x R x s setrc ts egenvectors are utuall orthogonal 4

Karhunen-Loève ransfor.e. f we choose A such that ts coluns a are orthonoral egenvectors of R x, we get: R A R A x 0 0 0 0 0 0 N If we further assue R x to be postve defnte, ---- > the egenvalues wll be postve. he resultng atrx A s known as Karhunen-Loève transfor (KL) = A x x N a Karhunen-Loève ransfor he Karhunen-Loève transfor (KL) A x x N a For ean-free vectors ( e.g. replace x b x E[ x ] ) ths process dagonalzes the covarance atrx Σ 5

KL Propertes: MSE-Approxaton 3 We defne a new vector n -densonal subspace ( < N ), ˆx usng onl bass vectors: xˆ a Projecton of x nto the subspace spanned b the used (orthonoral) egenvectors. Now, what s the expected ean square error between x and ts projecton ˆx : E x xˆ N E a E ( a )( a ) j j j KL Propertes: MSE-Approxaton 4 E x xˆ... E ( )( ) j a a j j N E N he error s nzed f we choose as bass those egenvectors correspondng to the largest egenvalues of the correlaton atrx. Aongst all other possble orthogonal transfors KL s the one leadng to nu MSE hs for of KL ( appled to ean free data) s also referred to as Prncpal Coponent Analss (PCA). he prncpal coponents are the egenvectors ordered (desc.) b ther respectve egenvalue agntudes 6

KL Propertes 5 otal varance Let w.l.o.g. E[x]=0 and = A x the KL (PCA) of x. Fro the prevous defntons we get: E.e. the egenvalues of the nput covarance atrx are equal to the varances of the transfored coordnates. Selectng those features correspondng to largest egenvalues retans the axal possble total varance (su of coponent varances) assocated wth the orgnal rando varables x. KL Propertes: Entrop 6 For a rando vector the entrop H E[ln p ( )] s a easure for the randoness of the underlng process. Exaple: for a zero-ean (=0) -d. Gaussan E [ ln( ( ) exp( ) ) ] H H E ln( ) ln [ ] ln( ) ln E trace E[ ] [ ] E[ trace ] E[ trace I ] Selectng those features correspondng to largest egenvalues axzes the entrop n the reanng features. No wonder: varance and randoness are drectl related! 7

Coputng a PCA: 7 Proble: Gven ean free data X, a set on n feature vectors x R. Copute the orthonoral egenvectors a of the correlaton atrx R x. here are an algorths that can copute ver effcentl egenvectors of a atrx. However, ost of these ethods can be ver unstable n certan specal cases. Here we present SVD, a ethod that s n general not the ost effcent one. However, the ethod can be ade nuercall stable ver easl! Coputng a PCA: 8 Sngular Value Decoposton: an Excursus to Lnear Algebra ( wthout Proofs ) 8

Sngular Value Decoposton : 9 SVD (reduced Verson): For atrces A R n wth n, there exst atrces U R n wth orthonoral coluns ( U U = I ), V R nn orthogonal ( V V = I ), R nn dagonal, wth n A=U V = A U V he dagonal values of (,,., n ) are called the sngular values. It s accustoed to sort the:. n SVD Applcatons: 0 SVD s an all-rounder! Once ou have U,, V, ou can use t to: - Solve Lnear Sstes: A x = b -. a) If A - exsts Copute atrx nverse b) for fewer equatons than unknowns c) for ore equatons than unknowns d) f there s no soluton: copute x that A x - b = n e) copute rank (nuercal rank) of a atrx - Copute PCA / KL 9

SVD : Matrx nverse A - A x = b : A=U V U,, V, exst for all A If A s square nxn and not sngular, then A - exsts. A U V V U V U n Coputng A - for a sngular A!? Snce U,, V all exst, the onl proble can orgnate f one σ = 0 or nuercall close to zero. --> sngular values ndcate f A s sngular or not!! SVD : Rank of a Matrx - he rank of A s the nuber of non-zero sngular values. - If there are ver sall sngular values, then A s close of beng sngular. We can set a threshold t, and set = 0 f t then the nuerc_rank ( A ) = # { > t } n = n A U V 0

SVD : Rank of a Matrx () 3 - nuerc_rank( A ) = # { > t }, the rank of A s equal the d( Ig( A ) ) n = s 0 0 A U V n = d( Ig(A) ) + d( Ker(A) ) - the coluns of U correspondng to the 0, span the range of A - the coluns of V correspondng to the = 0, span the nullspace of A reeber lnear appngs A x = b 4 ) case A - exsts R n A R A x = b ) A s sngular: d( Ker(A) ) 0 b

SVD : solvng A x = b 5 ) A s sngular: d( Ker(A) ) 0 x b here are an nfnte nuber of dfferent x that solve Ax=b!!?? Whch one should we choose?? e.g. we can choose the x wth x = n then we have to search n the space orthogonal to the nullspace SVD : Solvng A x - c = n 3) c s not n the range of A 6 c x c* ) Projectng c nto the range of A results n c* ) Fro all the solutons of A x = c* we choose the x wth x = n

SVD : Solvng A x - c = n 7 A x = c U V x = c x U V c V U c for an A exst U,, V, wth A= U V wth. n Coputng A - for a sngular A!? --> What to do n - wth /0 =???? V U c n Soe = 0 f t Reeber what we need ---- > SVD : Solvng: A x - c = n 8 We need to: ) Project c nto the range of A to obtan a c* ) Fro all the solutons of A x = c* we choose the x = n that s the x n the space orthogonal to the nullspace x = 0 0 c V U - the coluns of U correspondng to the 0, span the range of A - the coluns of V correspondng to the = 0, span the nullspace of A Bascall all rows or coluns ultpled b /0 are rrelevant!! --> so even settng /0 = 0, wll lead to the correct result. 3

SVD at Work: 9 For Lnear Sstes A x = b : Case fewer equatons than unknowns: fll rows of A wth zeros so that n = Perfor SVD on A wth (n ): Copute U,, V, wth A=U V Copute threshold t and n set = 0 for all t n - set / = 0 for all t For Lnear Sstes: copute Pseudonverse A + = V - U and copute x = A + b Applcaton: Copute PCA va SVD 30 Proble: Gven ean free data X, a set on n feature vectors x R copute the orthonoral egenvectors a of the correlaton atrx R x. Now we use SVD. Move center of ass to orgn: x =x -. Buld data atrx, fro ean free data X=U V 3. he prncpal axes are egenvector of the covarance atrx C = /n XX XX U U d 4

3 Applcaton: Copute PCA va SVD () wth SVD XX = U V (U V ) = U V (V U ) = U U = U U Snce C = /n XX the egenvalues copute to λ = /n σ wth λ = σ σ fro SVD σ varance of E[ ] Exaple: PCA on Iages 3 Assue we have a set of k ages (of sze NN) Each age can be seen as N -densonal pont p (lexcographcall ordered); the whole set can be stored as atrx: X p p p k Coputng PCA the naïve wa Buld correlaton atrx XX (N 4 eleents) Copute egenvectors fro ths atrx: O((N ) 3 ) Alread for sall ages (e.g. N=00) ths s far too expensve 5

PCA on Iages 33 Now we use SVD. Move center of ass to orgn: p =p -. Buld data atrx, fro ean free data X p p p n 3. he prncpal axes are egenvector of XX U U d PCA on Iages 34 ean face Faces Egenfaces Prncpal Coponents can be vsualzed b addng to the ean vector an egenvector ultpled b a factor (e.g. λ ) 6

PCA appled to face ages 35 Here the faces where noralzed n ee dstance and ee poston. ean face Choosng subspace denson r: Look at deca of the egenvalues as a functon of r Larger r eans lower expected error n the subspace data approxaton Egenfaces r Egenvalue spectru k Egenfaces for Face Recognton 36 In the 90 s the best perforng Face Recognton Sste! urk, M. and Pentland, A. (99). Face recognton usng egenfaces. In Proceedngs of Coputer Vson and Pattern Recognton, pages 586--59. IEEE. 7

PCA for Face Recognton 37 PCA & Dscrnaton 38 PCA/KL do not use an class labels n the constructon of the transfor. he resultng features a obscure the exstence of separate groups. 8

PCA Suar 39 Unsupervsed: no assupton about the exstence or nature of groupngs wthn the data. PCA s slar to learnng a Gaussan dstrbuton for the data. Optal bass for copresson (f easured va MSE). As far as densonalt reducton s concerned ths process s dstrbuton-free,.e. t s a atheatcal ethod wthout underlng statstcal odel. Extracted features (PCs) often lack ntuton. PCA an Neural Networks 40 A three-laer NN wth lnear hdden unts, traned as auto-encoder, develops an nternal representaton that corresponds to the prncpal coponents of the full data set. he transforaton F s a lnear projecton onto a k-densonal (Duda, Hart and Stork: chapter 0.3.). 9