T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Size: px

Start display at page:

Download "T Algorithmic methods for data mining. Slide set 6: dimensionality reduction"

Carol Fletcher
6 years ago
Views:

1 T Algrithmic methds fr data mining Slide set 6: dimensinality reductin

2 reading assignment LRU bk: PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss, Dasgupta and Gupta Database-friendly randm prjectins: Jhnsn- Lindenstrauss with binary cins, Achliptas Randm prjectin in dimensinality reductin: Applicatins t image and text data, Bingham and Mannila T slide set 6: dimensinality reductin 2

3 the curse f dimensinality the efficiency f many algrithms depends n the number f dimensins d distance / similarity cmputatins are at least linear t the number f dimensins index structures fail as the dimensinality f the data increases data in large dimensins is difficult t visualize T slide set 6: dimensinality reductin 3

4 what if we were able t......reduce the dimensinality f the data, while maintaining the meaningfulness f the data? T slide set 6: dimensinality reductin 4

5 dimensinality reductin cnsider dataset X cnsisting f n pints in a d- dimensinal space d data pint x in X is a vectr in R data can be seen as an n x d matrix X = 0 x x 1d x n1... x nd 1 C A dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes T slide set 6: dimensinality reductin 5

6 dimensinality reductin dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes bth methdlgies map each vectr x in R d t a vectr y in R k mapping: A : R d R k fr the idea t be useful we want: k<<d 6 T slide set 6: dimensinality reductin

7 linear dimensinality reductin dimensinality-reductin mapping: A : R d R k assume that A is a linear mapping it can be seen as a matrix (d x k) y = x A s Y = X A bjective: Y shuld be as clse as pssible t X T slide set 6: dimensinality reductin 7

8 clseness: pairwise distances Jhnsn-Lindenstrauss lemma: cnsider dataset X f n pints in R d, and ɛ>0 then there exists k=o(ɛ -2 lgn) and a linear mapping A : R d R k, such that fr all x and z in X (1-ɛ) x-z 2 (d/k) xa-za 2 (1+ɛ) x-z 2 T slide set 6: dimensinality reductin 8

9 clseness: pairwise distances Jhnsn-Lindenstrauss lemma: cnsider dataset X f n pints in R d, and ɛ>0 then there exists k=o(ɛ -2 lgn) and a linear mapping A : R d R k, such that fr all x and z in X (1-ɛ) x-z 2 (d/k) xa-za 2 (1+ɛ) x-z 2 what is the intuitive interpretatin f this statement? T slide set 6: dimensinality reductin 8

10 Jhnsn-Lindenstrauss lemma: intuitin each vectr x in X is prjected nt a k-dimensinal vectr y = xa dimensin f the prjected space is k=o(ɛ -2 lgn) sq. distance x-z 2 is apprximated by (d/k) xa-za 2 intuitin: expected sq. nrm f a prjectin f a unit vectr nt a randm subspace is k/d the prbability that it deviates frm its expectatin is very small T slide set 6: dimensinality reductin 9

11 the randm prjectins each vectr x in X is prjected nt a k-dimensinal vectr y = xa randm prjectins are represented by a linear transfrmatin matrix A y = x A what is the matrix A? T slide set 6: dimensinality reductin 10

12 the randm prjectins the elements A(i,j) f A can be drawn frm the nrmal distributin N(0,1) resulting rws f A define randm directins in R d anther way t define A is ([Achliptas 2003]) A(i, j) = 8 < : 1 with prb. 1/6 0 with prb. 2/3 1 with prb. 1/6 why is this useful? all zer-mean, unit-variance distributins fr A(i,j) wuld give a mapping that satisfies the Jhnsn-Lindenstrauss lemma T slide set 6: dimensinality reductin 11

13 datasets as matrices cnsider dataset in the frm f an n x d matrix X n bjects as rws, d dimensins as features X(i,j) represents the imprtance f feature j fr bject i gal: understand the structure f the data, e.g., the underlying prcess that generates the data reduce the number f features representing the data T slide set 6: dimensinality reductin 12

14 mtivating examples find a subset f prducts that characterize custmers find a subset f grups that characterize users f a scial netwrk find a subset f terms that accurately clusters dcuments T slide set 6: dimensinality reductin 13

15 principal cmpnent analysis idea: lk fr a directin that the data prjected nt it has maximal variance when fund, cntinue by seeking the next directin, which is rthgnal t this (i.e., uncrrelated), and which explains as much f the remaining variance in the data as pssible thus, we are seeking linear cmbinatins f the riginal variables if we are lucky, we can find a few such linear cmbinatins, r directins, r (principal) cmpnents, which describe the data accurately the aim is t capture the intrinsic variability in the data T slide set 6: dimensinality reductin 14

16 principal cmpnent analysis T slide set 6: dimensinality reductin 15

17 principal cmpnent analysis 1st principal cmpnent T slide set 6: dimensinality reductin 15

18 principal cmpnent analysis 2nd principal cmpnent 1st principal cmpnent T slide set 6: dimensinality reductin 15

19 principal cmpnent analysis cnsider X t be the n x d data matrix assume that X is zer centered (each clumn sums t 0) let w define the prjectin we are lking (a d x 1 vectr; we require w T w = 1) prjectin f the data n w maximizes the variance prjectin f a data pint x n w is x w prjectin f data X n w is Xw T slide set 6: dimensinality reductin 16

20 zer-centered data 0 T slide set 6: dimensinality reductin 17

21 zer-centered data 0 T slide set 6: dimensinality reductin 17

22 zer-centered data 0 T slide set 6: dimensinality reductin 17

23 zer-centered data 0 T slide set 6: dimensinality reductin 17

24 principal cmpnent analysis prjectin f data X n w is Xw variance: Var(w) = (Xw) T (Xw) = w T X T Xw = w T Cw where C = X T X is the cvariance matrix f the data maximize w T Cw subject t the cnstraint w T w=1 maximize f = w T Cw - λ(w T w-1) where λ is the Lagrange multiplier T slide set 6: dimensinality reductin 18

25 principal cmpnent analysis ptimizatin prblem: maximize f = w T Cw - λ(w T w-1) differentiating with respect t w gives 2Cw - 2λw = 0 eigenvalue equatin Cw = λw, where C = X T X but eigenvalues f C are the singular values f X T slide set 6: dimensinality reductin 19

26 recall: singular value decmpsitin (SVD) every n x d matrix X can be decmpsed in the frm X = U Σ V T where U is an rthgnal matrix cntaining the left singular vectrs f X V is an rthgnal matrix cntaining the right singular vectrs f X Σ is a diagnal matrix cntaining the singular values f X (σ 1 σ 2 ) extremely useful tl fr analyzing data T slide set 6: dimensinality reductin 20

27 significant nise singular value decmpsitin X = U Σ V T dimensins sig. significant bjects = nise nise X k = U k Σ k V k T is the best rank-k apprximatin f X T slide set 6: dimensinality reductin 21

28 principal cmpnent analysis we shwed that the principal cmpnents are the singular values f X in particular: i-th principal cmpnent f X is the i-th right singular vectr f X the variance n the i-th principal cmpnent is exactly the i-th singular value squared ( σ i 2 ) rule f thumb: cnsider k principal cmpnents s that yu capture abut 85% f the variance f the riginal data (can be estimated using the singular values) T slide set 6: dimensinality reductin 22

29 principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem T slide set 6: dimensinality reductin 23

30 principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? T slide set 6: dimensinality reductin 23

31 principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? hw t cmpute PCA: center the data t get X d SVD n X T slide set 6: dimensinality reductin 23

32 example f PCA PCA is used a lt fr data visualizatin example: spatial data analysis data: 9000 dialect wrds, 500 cunties in Finland wrd-cunty matrix X X(i, j) = 1 if wrd i appears in cunty j 0 therwise apply PCA n X T slide set 6: dimensinality reductin 24

33 example f PCA data pints: wrds; variables: cunties each principal cmpnent tells which cunties explain the mst significant part f the variatin left in the data the first principal cmpnent is essentially just the number f wrds in each cunty! after this, gegraphical structure f principal cmpnents is apparent nte: PCA knws nthing f the gegraphy f the cunties T slide set 6: dimensinality reductin 25

34 T slide set 6: dimensinality reductin 26

35 T slide set 6: dimensinality reductin 27

36 applicatins f PCA data visualizatin and explratin data cmpressin utlier detectin... T slide set 6: dimensinality reductin 28

37 randm prjectins vs. PCA different bjectives randm prjectins preserve distances PCA finds directins f maximum variance in the data PCA invlves SVD, very inefficient fr large data randm prjectins can be implemented very efficiently, especially, sparse variants T slide set 6: dimensinality reductin 29

38 Errr using RP, SRP, PCA and DCT randm prjectins vs. PCA Reduced dim. f data flps Flps needed using PCA, RP, SRP and DCT Figure 1: The errr prduced by RP (+), SRP ( ), PCA ( ) and DCT ( ) n image data, and 95 % cnfidence intervals ver 100 pairs f data vectrs Reduced dim. f data [Bingham and Mannila 2001] Figure 2: Number f Matlab s flating pint peratins needed when reducing the dimensinality f image data using RP (+), SRP ( ), PCA ( ) and DCT ( ), in a lgarithmic scale. T slide set 6: dimensinality reductin 30

39 thanks: slides n PCA adapted by slides f Saara Hyvönen T slide set 6: dimensinality reductin

Principal Components

Principal Components Principal Cmpnents Suppse we have N measurements n each f p variables X j, j = 1,..., p. There are several equivalent appraches t principal cmpnents: Given X = (X 1,... X p ), prduce a derived (and small)