PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining

Size: px

Start display at page:

Download "PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining"

Solomon Cannon
6 years ago
Views:

1 Dimesioality reductio Uiversity of Szeged

2 The role of dimesioality reductio We ca spare computatioal costs (or simply fit etire datasets ito mai memory) if we represet data i fewer dimesios Visualizatio of datasets (i 2 or 3 dimesios) Elimiatio of oise from data, feature selectio Key idea: try to represet data poits i lower dimesios Depedig our ojective fuctio with respect the lower dimesioal represetatio,,,...

3 ricipal Compoet Aalysis Trasform multidimesioal data ito lower dimesios i such a way that we lose as little proportio of the origial variatio of the data as possile Assumptio: data poits of the origial m-dimesioal space lie at (or at least very close to) a m -dimesioal suspace we shall express data poits with respect this suspace What that m -dimesioal suspace might e? We would like to miimize the recostructio error X k (xi xi ) k2 i=1, where xi is a approximatio for poit xi

4 Covariace Remider Quatifies how much radom variales Y ad Z chage together cov (Y, Z ) = E[(Y µy )(Z µz )] µy = 1 i=1 yi ad µz = 1 zi i=1 Colums i, j of data matrix X (i.e. X:,i, X:,j ) ca e regarded as oservatios from two radom variales

5 Scatter ad covariace matrix Scatter matrix: S = (xi µ)(xi µ) k=1 (U)iased covariace matrix: Σ = 1 S (Σ = 1 1 S) cov (X:,1, X:,1 ) cov (X:,1, X:,2 )... cov (X:,1, X:,m ) cov (X:,2, X:,1 ) cov (X:,2, X:,2 )... cov (X:,2, X:,m ) Σ= cov (X, X ). :,i :,j cov (X:,m, X:,1 ) cov (X:,m, X:,m ) Σi,j is the covariace of variales i ad j (cov (X:,i, X:,j )) What values are icluded i the mai diagoal?

6 Characteristics of scatter ad covariace matrices Claim: matrices S ad Σ are symmetric ad positive defiite Bizoyítás. S= (xi µ)(xi µ) =! (xi µ)(xi µ) = S k=1 k=1 a Sa = k=1 (a (xi µ))((xi µ) a) = (a (xi µ))2 k=1 Cosequece: the eigevalues of S ad C are λ1 λ2... λm The m -dimesioal projectio which preserves most of the variatio of the data ca e otaied y projectig data poits usig the eigevectors elogig to the m highest eigevalues of either matrix S (or C ) (proof: see tale)

7 Lagrage multipliers rovides a schema for solvig (o-)liear optimizatio prolems f (x) mi/max such that gi (x) = i {1,..., } Lagrage fuctio: L(x, λ) = f(x) X λi gi (x) i=1 Karush-Kuh-Tucker (KKT) coditios: ecessity coditios for a optimum L(x, λ) = (1) λi gi (x) = i {1,..., } (2) λi (3)

8 ractical issues Its worth hadlig all the features o similar scales mi-max ormalizatio: xi,j = stadardizatio: xi,j = xi,j mi(x,j ) max(x,j ) mi(x,j ) xi,j µj σj How to choose the reduced dimesioality (m )? m m si2 λi = Hit : i=1 i=1 k i=1 λi m = arg mi m t threshold 1 k m i=1 λi m 1 X λi > λj 1 i m m m = arg max j=1 m = arg max (λi λi+1 ) 1 i m 1

9 Summarizig Sutract the mea vector from data X ad also ormalize it somehow Calculate the scatter/covariace matrix of the ormalized data Calculate its eigevalues Form projectio matrix from the eigevectors correspodig to the m largest eigevalues X = X gives the trasformed data X 1 gives a approximatio o the origial positios of the data poits A useful tutorial o

10 Sigular Value Decompositio X = U x Sigma x V t rak(x ) σi ui vi i=1 s s rak(x m ) 2 xij2 = kx kf = σi X = UΣV = i=1 j=1 i=1 Low(er) rak approximatio of X is X = U Σ V We rely o the top m < m largest sigular value of Σ upo recostructig X This is the est possile m -dimesioal approximatio of X if we look for the approximatio which miimizes kx X kfroeius

11 Sigular Value Decompositio ad Eigedecompositio Remider Ay symmetric matrix A is decomposale as A = X ΛX 1, where X = [x1 x2... xm ] comprises of the orthogoal eigevectors of A ad Λ = diag ([λ1 λ2... λm ]) cotaiig the correspodig eigevalues i its mai diagoal. Why? Ay m matrix X ca e uiquely decomposed ito the product of three matrices of the form UΣV where U = [u1 u2... u ] is the orthoormal matrix cosistig of the eigevectors of XX Σ m = diag ( λ1, λ2,..., λm ) Vm m = [v1 v2... vm ] is the orthoormal matrix cosistig of the eigevectors of X X Why? Orthogoal matrices: M M = I (a trasformatio which preserves distace i the trasformed space as well) Why?

12 Relatio etwee ad Froeius-orm Suppose M = Q R, i.e. mij = k 58 M = = pik qkl rlj l 3 m32 = pik qkl rlj The kmk2f = (mij )2 = i j i j k l 2 Also pik qkl rlj = pik qkl rlj pi qm rmj k l k l m From where kmk2f = pik qkl rlj pi qm rmj i j k l m Give tha matrices, Q, R origiate from a decompositio, kmk2f = pik qkk rkj pi q rj = qkk rkj qkk rkj = (qkk )2. i,j,k, j,k Why? The error of the approximatig Xy X = U Σ V is kx X k2 = ku(σ Σ )V k2 = Data (σ miig σ )2 k

13 CUR The drawack of is that a typically sparse matrix is decomposed ito a products of dese matrices (i.e. U ad V ) Oe alterative is to use CUR decompositio This time oly matrix U happes to e dese Matrices C ad R are composed of the rows ad colums of the matrix X, thus they preserve the sparsity of X is uique ulike CUR

14 CUR decompositio producig C ad R Choose k colums (rows) from the data matrix with replacemet A colum (row) ca e selected more tha oce ito C (R) The proaility of selectig a colum (row) should e proportioal to the sum of squared elemets i it Scale the elemets of the selected colum (row) y 1/ kpi

15 CUR decompositio producig U Create W Rkxk from the itersectio of the k selected rows ad colums erform o W such that W = X ΣY Let U = Y (Σ+ )2 X where Σ+ is the pseudoiverse of Σ Σ is diagoal, so calculatig its pseudoiverse is easy Just take the reciprocal of the o-zero elemets i the diagoal Zeros ca e left itact

16 Alie Matrix Casalaca Titaic CUR decompositio example aul Sam Martha Bo Sarah C = U= 7.1 R=

17 Liear Discrimiat Aalysis Trasform data poits ito lower dimesios i such a way that poits of the same class have as little dispersio as possile whereas poits of differet classes mix as little as possile How should we choose w, i.e. the directio of the projectio? µ ~i = w µi µ~1 µ~2 = w (µ1 µ2 ) X s i = (w x ~ µi )2 x with lael i w = arg max J(w) = arg max w w ~ µ1 ~ µ2 2 2 s 1 + s 22 (1)

18 Withi ad outer scatter matrices The withi-scatter matrix of poits with class lael i: X Si = (x µi )(x µi ) x with lael i Aggregated withi scatter matrix: SW = S1 + S2 Scatter of the poits with class lael i: X X s i 2 = (w x w µi )2 = w (x µi )(x µi ) w = w Si w x with lael i Scatter matrix of the poits etwee differet classes: SB = (µ1 µ2 )(µ1 µ2 ) Scatter of the poits etwee differet classes: (µ 1 µ 2 )2 = (w µ1 w µ2 )2 = w SB w

19 A equivalet ojective with Eq. (1) is w = arg maxw J(w) = arg maxw ww SSWB ww w SB w w SW w is the so-called geeralized Rayleigh-coefficiet J(w) is maximal ww SSWB ww = SB w = λsw w 1 1 SW SB w = λw w = SW (µ1 µ2 ) Remider f (x) (x)g (x) = f (x)g (x) f g (x) g 2 (x) x x Ax = 2Ax, give that A = A xx y = xi yi x (i.e. a vector poitig i=1 of x) i the directio

20 vs

21 Multi-Dimesioal Scalig (MDS) Goal: give cotaiig pair-wise cost/distaces of poits fid the positios of the poits for which k xi xj k δij Trasforms multidimesioal poits ito lower dimesios such that the pairwise distaces get preserved as much as possile

22 Locally Liear Emeddig (LLE), ad all assume liear relatioship etwee variales No-liear dimesioality reductio techique Idea: defie the earest eighors for all poits ad defie them as their liear comiatio Wij = 1, és Wij xj k2, such that kxi J(W ) = i=1 j=1 j=1 wij > xj eighors(xi )

23 Caoical Correlatio Aalysis (CCA) Our data poits have two distict represetatios (coordiate systems) Goal: fid a commo coordiate system (with reduced dimesioality) such that the correlatio etwee the trasformed poits get maximized E [xy ] = E [x 2 ] E [y 2 ] wx Cxy w y wx Cxx wx wy Cyy wy ρ= E[wx xy wy ] E[wx xx wx ] E [wy yy wy ] = arg max ρ is idepedet from the legth of wx ad wy arg max ρ = arg max wx Cxy wy " # " # Σxx Σxy x x Σ= =E y y Σyx Σyy

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm