T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin

reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss, Dasgupta and Gupta Database-friendly randm prjectins: Jhnsn- Lindenstrauss with binary cins, Achliptas Randm prjectin in dimensinality reductin: Applicatins t image and text data, Bingham and Mannila T-61.5060 -- slide set 6: dimensinality reductin 2

the curse f dimensinality the efficiency f many algrithms depends n the number f dimensins d distance / similarity cmputatins are at least linear t the number f dimensins index structures fail as the dimensinality f the data increases data in large dimensins is difficult t visualize T-61.5060 -- slide set 6: dimensinality reductin 3

what if we were able t......reduce the dimensinality f the data, while maintaining the meaningfulness f the data? T-61.5060 -- slide set 6: dimensinality reductin 4

dimensinality reductin cnsider dataset X cnsisting f n pints in a d- dimensinal space d data pint x in X is a vectr in R data can be seen as an n x d matrix X = 0 B @ x 11... x 1d...... x n1... x nd 1 C A dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes T-61.5060 -- slide set 6: dimensinality reductin 5

dimensinality reductin dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes bth methdlgies map each vectr x in R d t a vectr y in R k mapping: A : R d R k fr the idea t be useful we want: k<<d 6 T-61.5060 -- slide set 6: dimensinality reductin

linear dimensinality reductin dimensinality-reductin mapping: A : R d R k assume that A is a linear mapping it can be seen as a matrix (d x k) y = x A s Y = X A bjective: Y shuld be as clse as pssible t X T-61.5060 -- slide set 6: dimensinality reductin 7

clseness: pairwise distances Jhnsn-Lindenstrauss lemma: cnsider dataset X f n pints in R d, and ɛ>0 then there exists k=o(ɛ -2 lgn) and a linear mapping A : R d R k, such that fr all x and z in X (1-ɛ) x-z 2 (d/k) xa-za 2 (1+ɛ) x-z 2 T-61.5060 -- slide set 6: dimensinality reductin 8

Jhnsn-Lindenstrauss lemma: intuitin each vectr x in X is prjected nt a k-dimensinal vectr y = xa dimensin f the prjected space is k=o(ɛ -2 lgn) sq. distance x-z 2 is apprximated by (d/k) xa-za 2 intuitin: expected sq. nrm f a prjectin f a unit vectr nt a randm subspace is k/d the prbability that it deviates frm its expectatin is very small T-61.5060 -- slide set 6: dimensinality reductin 9

the randm prjectins each vectr x in X is prjected nt a k-dimensinal vectr y = xa randm prjectins are represented by a linear transfrmatin matrix A y = x A what is the matrix A? T-61.5060 -- slide set 6: dimensinality reductin 10

the randm prjectins the elements A(i,j) f A can be drawn frm the nrmal distributin N(0,1) resulting rws f A define randm directins in R d anther way t define A is ([Achliptas 2003]) A(i, j) = 8 < : 1 with prb. 1/6 0 with prb. 2/3 1 with prb. 1/6 why is this useful? all zer-mean, unit-variance distributins fr A(i,j) wuld give a mapping that satisfies the Jhnsn-Lindenstrauss lemma T-61.5060 -- slide set 6: dimensinality reductin 11

datasets as matrices cnsider dataset in the frm f an n x d matrix X n bjects as rws, d dimensins as features X(i,j) represents the imprtance f feature j fr bject i gal: understand the structure f the data, e.g., the underlying prcess that generates the data reduce the number f features representing the data T-61.5060 -- slide set 6: dimensinality reductin 12

mtivating examples find a subset f prducts that characterize custmers find a subset f grups that characterize users f a scial netwrk find a subset f terms that accurately clusters dcuments T-61.5060 -- slide set 6: dimensinality reductin 13

principal cmpnent analysis idea: lk fr a directin that the data prjected nt it has maximal variance when fund, cntinue by seeking the next directin, which is rthgnal t this (i.e., uncrrelated), and which explains as much f the remaining variance in the data as pssible thus, we are seeking linear cmbinatins f the riginal variables if we are lucky, we can find a few such linear cmbinatins, r directins, r (principal) cmpnents, which describe the data accurately the aim is t capture the intrinsic variability in the data T-61.5060 -- slide set 6: dimensinality reductin 14

principal cmpnent analysis T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis 1st principal cmpnent T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis 2nd principal cmpnent 1st principal cmpnent T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis cnsider X t be the n x d data matrix assume that X is zer centered (each clumn sums t 0) let w define the prjectin we are lking (a d x 1 vectr; we require w T w = 1) prjectin f the data n w maximizes the variance prjectin f a data pint x n w is x w prjectin f data X n w is Xw T-61.5060 -- slide set 6: dimensinality reductin 16

zer-centered data 0 T-61.5060 -- slide set 6: dimensinality reductin 17

principal cmpnent analysis prjectin f data X n w is Xw variance: Var(w) = (Xw) T (Xw) = w T X T Xw = w T Cw where C = X T X is the cvariance matrix f the data maximize w T Cw subject t the cnstraint w T w=1 maximize f = w T Cw - λ(w T w-1) where λ is the Lagrange multiplier T-61.5060 -- slide set 6: dimensinality reductin 18

principal cmpnent analysis ptimizatin prblem: maximize f = w T Cw - λ(w T w-1) differentiating with respect t w gives 2Cw - 2λw = 0 eigenvalue equatin Cw = λw, where C = X T X but eigenvalues f C are the singular values f X T-61.5060 -- slide set 6: dimensinality reductin 19

recall: singular value decmpsitin (SVD) every n x d matrix X can be decmpsed in the frm X = U Σ V T where U is an rthgnal matrix cntaining the left singular vectrs f X V is an rthgnal matrix cntaining the right singular vectrs f X Σ is a diagnal matrix cntaining the singular values f X (σ 1 σ 2 ) extremely useful tl fr analyzing data T-61.5060 -- slide set 6: dimensinality reductin 20

significant nise singular value decmpsitin X = U Σ V T dimensins sig. significant bjects = nise nise X k = U k Σ k V k T is the best rank-k apprximatin f X T-61.5060 -- slide set 6: dimensinality reductin 21

principal cmpnent analysis we shwed that the principal cmpnents are the singular values f X in particular: i-th principal cmpnent f X is the i-th right singular vectr f X the variance n the i-th principal cmpnent is exactly the i-th singular value squared ( σ i 2 ) rule f thumb: cnsider k principal cmpnents s that yu capture abut 85% f the variance f the riginal data (can be estimated using the singular values) T-61.5060 -- slide set 6: dimensinality reductin 22

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem T-61.5060 -- slide set 6: dimensinality reductin 23

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? T-61.5060 -- slide set 6: dimensinality reductin 23

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? hw t cmpute PCA: center the data t get X d SVD n X T-61.5060 -- slide set 6: dimensinality reductin 23

example f PCA PCA is used a lt fr data visualizatin example: spatial data analysis data: 9000 dialect wrds, 500 cunties in Finland wrd-cunty matrix X X(i, j) = 1 if wrd i appears in cunty j 0 therwise apply PCA n X T-61.5060 -- slide set 6: dimensinality reductin 24

example f PCA data pints: wrds; variables: cunties each principal cmpnent tells which cunties explain the mst significant part f the variatin left in the data the first principal cmpnent is essentially just the number f wrds in each cunty! after this, gegraphical structure f principal cmpnents is apparent nte: PCA knws nthing f the gegraphy f the cunties T-61.5060 -- slide set 6: dimensinality reductin 25

T-61.5060 -- slide set 6: dimensinality reductin 26

T-61.5060 -- slide set 6: dimensinality reductin 27

applicatins f PCA data visualizatin and explratin data cmpressin utlier detectin... T-61.5060 -- slide set 6: dimensinality reductin 28

randm prjectins vs. PCA different bjectives randm prjectins preserve distances PCA finds directins f maximum variance in the data PCA invlves SVD, very inefficient fr large data randm prjectins can be implemented very efficiently, especially, sparse variants T-61.5060 -- slide set 6: dimensinality reductin 29

10 0 10 Errr using RP, SRP, PCA and DCT randm prjectins vs. PCA 20 30 40 50 60 70 0 100 200 300 400 500 600 700 Reduced dim. f data flps 10 12 10 11 10 10 10 9 10 8 10 7 Flps needed using PCA, RP, SRP and DCT Figure 1: The errr prduced by RP (+), SRP ( ), PCA ( ) and DCT ( ) n image data, and 95 % cnfidence intervals ver 100 pairs f data vectrs. 10 6 0 100 200 300 400 500 600 700 Reduced dim. f data [Bingham and Mannila 2001] Figure 2: Number f Matlab s flating pint peratins needed when reducing the dimensinality f image data using RP (+), SRP ( ), PCA ( ) and DCT ( ), in a lgarithmic scale. T-61.5060 -- slide set 6: dimensinality reductin 30

thanks: slides n PCA adapted by slides f Saara Hyvönen T.61-5060 -- slide set 6: dimensinality reductin