Advanced data analysis Akisato Kimura ( 木村昭悟 ) NTT Communication Science Laboratories E-mail: akisato@ieee.org
Advanced data analysis 1. Introduction (Aug 20) 2. Dimensionality reduction (Aug 20,21) PCA, LPP, FDA, CCA, PLS 3. Non-linear methods (Aug 27) Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering (Aug 28) K-means, spectral clustering 5. Generalization (Sep 3) 4
Class web page http://www.brl.ntt.co.jp/people/akisato/titech/c lass.html Slides and data will be uploaded on this page. 5
Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 6
Curse of dimensionality 7 xx ii ii=1 nn, xx ii R dd, dd 1 If your data samples are high-dimensional, they are often too complex to directly analyze. Usual geometric intuition is often only applicable to low-dimensional problems. Such geometric intuition could be even misleading in high-dimensional problems. 7
Curse of dimensionality (cont.) When the dimensionality increases, Volume of unit hyper-cube VV cc is always 1. Volume of inscribed hyper-sphere VV ss goes to 0. Relative size of hyper-sphere gets small! VV ss 0 VVcc (in contradiction to our geometric intuition) dd = 1 dd = 2 dd = 3 dd 1 ππ 0.5 2 0.79 4ππ 0.5 3 /3 0.52 ππ nn/2 Γ nn/2 + 1 0.5 dd 0 8
Curse of Dimensionality (cont.) Grid sampling requires an exponentially large number of points. dd = 1 dd = 2 dd = 3 dd nn = 5 nn = 5 2 nn = 5 3 nn = 5 dd Unless you have an exponentially large number of samples, your high-dimensional samples are never dense. 9
Dimensionality reduction We want to reduce the dimensionality of data while preserving the intrinsic information in the data. Dimensionality reduction is also called embedding. If the dimension is reduced up to 3, it is also called data visualization. Basic assumption (or belief) behind dimensionality reduction: your highdimensional data is redundant in some sense. 10
Notation: Linear embedding Data samples: xx ii ii=1, xx ii R dd, dd 1 Embedding matrix: BB R mm dd, 1 mm dd Embedded data samples: nn, zz ii = BBxx ii R mm zz ii ii=1 = mm zz ii BB xx ii dd R mm R dd xx ii BB zz ii 11
Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 12
Principal component analysis (PCA) Idea: We want to get rid of a redundant dimension of the data samples 10 0, 20 0.1, 30 0.1 10, 20, 30 This could be achieved by minimizing the distance between embedded samples and original samples. xx ii zz ii 13
Data centering We center the data samples by nn nn xx ii = xx ii 1 nn xx jj 1 nn jj=1 ii xx ii = 0 In matrix, XX = XXXX XX = xx 1 xx 2 xx nn XX = (xx 1 xx 2 xx nn ) HH = II nn 1 nn 11 nn nn II nn : nn-dimensional identity matrix 11 nn nn : nn nn matrix with all ones 14
Orthogonal projection bb ii R dd mm ii=1 : Orthonormal basis in mm-dimensional embedding subspace bb ii, bb jj = bb 1 (ii = jj) ii bb jj = δδ ii,jj = 0 (ii jj) In matrix, BB = bb 1 bb 2 bb mm BBBB = II mm The orthogonal projection of xx ii is expressed by mm bb jj, xx ii bb jj (= BB BB xx ii ) jj=1 15
PCA criterion Minimize the sum of squared distances. nn ii=1 BB BB xx ii xx ii 2 = tr BB CCBB + tr CC tr BB CCBB PCA criterion: BB PPPPPP = arg max tr BB CCBB BB R mm dd subject to BBBB = II mm mm = bb ii CCbb ii ii=1 nn CC = ii=1 xx ii xx ii = XXXX xx ii BB BBxx ii 16
PCA: Summary A PCA solution: BB PPPPPP = ψψ 1 ψψ 2 ψψ mm dd λλ ii, ψψ ii ii=1 : Sorted eigenvalues and normalized eigenvectors of CCψψ = λλψψ λλ 1 λλ 2 λλ dd, ψψ ii, ψψ jj = δδ ii,jj PCA embedding of sample : zz = BB PPPPPP xx 1 nn XX11 nn 11 nn : nn-dimensional vector with all ones Data centering 17
Proof We first show necessary conditions of solutions. Lagrangian: LL BB, ΔΔ = tr BB CCBB Necessary conditions: tr( BBBB II mm ΔΔ) ΔΔ: Lagrange multipliers (symmetric matrix) LL BB = 2BB CC 2ΔΔBB = 00 LL ΔΔ = BBBB II mm = 00 CCBB = BB ΔΔ BBBB = II mm (2) (1) 18
Proof (cont.) Eigendecomposition of ΔΔ ΔΔ = TTTTTT (3) TT: orthogonal matrix (TT 1 = TT ) (1) CCBB = BB ΔΔ ΓΓ: diagonal matrix CCBB = BB TTTTTT (4) CCBB TT = BB TTΓΓ (5) This is an eigensystem of CC ΓΓ = diag λλ kk1, λλ kk2, λλ kkmm (6) BB TT = (ψψ kk1 ψψ kk2 ψψ kkmm ) kk ii {1,2,, dd} BB = TT ψψ kk1 ψψ kk2 ψψ kkmm (7) 19
Proof (cont.) BBBB = II mm rank(bb) = mm All kk ii ii=1 mm are distinct. Necessary conditions summary (3) ΔΔ = TTTTTT (6) ΓΓ = diag λλ kk1, λλ kk2, λλ kkmm (7) BB = TT ψψ kk1 ψψ kk2 ψψ kkmm All kk mm ii ii=1 are distinct. 20
Proof (cont.) Now, we choose the best kk ii ii=1 mm that maximizes the objective function tr BB CCBB. (2), (4) & (6) (4) (2) tr BB CCBB = tr BBBB TTΓΓTT = tr TTΓΓTT = tr ΓΓTT TT = mm ii=1 λλ kkii (6) (6) + TT is orthogonal λλ 1 λλ 2 λλ dd kk ii = ii maximizes the objective function. Choosing TT = II mm gives BB = ψψ 1 ψψ 2 ψψ mm 21
Pearson correlation nn Correlation coefficient for ss ii, tt ii ii=1 ρρ = nn ii=1 nn ii=1 ss ii ss ss ii ss 2 tt ii tt nn ii=1 tt ii tt 2 : ss = ss ii /nn tt = tt ii /nn Positively correlated Uncorrelated Negatively correlated ρρ > 0 ρρ 0 ρρ < 0 22
PCA uncorrelates data BB PPPPPP = ψψ 1 ψψ 2 ψψ mm The covariance matrix of PCA-embedded samples is diagonal. 1 nn zz ii zz ii = diag(λλ 1, λλ 2,, λλ mm ) nn ii (Homework) Elements in zz are uncorrelated! 23
Examples Data is well described. PCA is intuitive, easy to implement, analytically computable, and fast. 24
Examples (cont.) Iris data (4d->2d) Letter data (16d->2d) Embedded samples seem informative. 25
Examples (cont.) However, PCA does not necessarily preserve interesting information such as clusters. 26
Homework 1. Implement PCA and reproduce the 2- dimensional examples shown in the class. Datasets 1 and 2 are available from http://www.ms.k.u-tokyo.ac.jp/sugi/data/dataanalysis/ (Optional) Test PCA on your own (artificial or real) data and analyze characteristics of PCA. 27
Homework (cont.) 2. Prove that PCA uncorrelates samples. More specifically, prove that the covariance matrix of PCA-embedded samples is the following diagonal matrix: 1 nn nn ii zz ii zz ii = diag(λλ 1, λλ 2,, λλ mm ) BB PPPPPP = ψψ 1 ψψ 2 ψψ mm zz ii = BB PPPPPP xx ii 28
30
Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 31
Locality preserving projection (LPP) PCA finds a subspace that describes the data well. However, PCA can miss some interesting structures such as clusters. Another idea: Find a subspace that well preserves local structures in the data. 32
Similarity matrix Similarity matrix WW : the similar xx ii and xx jj are, the larger WW ii,jj is. Assumptions on WW : Symmetric: WW ii,jj = WW jj,ii Normalized: 0 WW ii,jj 1 WW is also called the affinity matrix. 33
Examples of similarity matrix Distance-based: WW ii,jj = exp xx ii xx jj 2 /γγ 2, γγ > 0 Nearest-neighbor-based: WW ii,jj = 1 if xx ii is a kk-nearest neighbor of xx jj or xx jj is a kk-nearest neighbor of xx ii. Otherwise, WW ii,jj = 0. Combination of these two is also possible 34
LPP criterion Idea: embed two close points as close, i.e., minimize nn 2 WW ii,jj BBxx ii BBBB jj 0 ii,jj=1 This is expressed as 2tr BBBBBBXX BB (Homework) XX = xx 1 xx 2 xx nn LL = DD WW nn DD = diag jj=1 nn WW 1,jj,, jj=1 WW nn,jj Since BB = 00 gives a meaningless solution, we impose BBBBBBXX BB = II mm 35
LPP: Summary LPP criterion: BB LLLLLL = arg min tr BB R BBBBBBXX BB mm dd subject to BBBBBBXX BB = II mm Solution (homework): BB LLLLLL = ψψ dd ψψ dd 1 ψψ dd mm+1 dd λλ ii, ψψ ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of XXXXXX ψψ = λλxxxxxx ψψ λλ 1 λλ 2 λλ dd, XXXXXX ψψ ii, ψψ jj = δδ ii,jj LPP embedding of sample xx : zz = BB LLLLLL xx 36
Generalized eigenvalue problem AAAA = λλcccc CC: positive symmetric matrix Then, there exists a positive symmetric matrix CC 1/2 such that CC 1/2 2 = CC Eigenvalue decomposition of CC CC = γγ ii φφ ii φφ ii, γγ > 0 ii CC 1/2 = γγ ii φφ ii φφ ii ii 37
Generalized eigenvalue problem (cont.) AAAA = λλcccc Letting φφ = CC 1/2 ψψ, we obtain CC 1/22 AACC 1/2 φφ = λλφφ This is an ordinary eigenproblem. Ordinary eigenvectors are orthonormal. 1 (ii = jj) φφ ii, φφ jj δδ ii,jj = 0 (ii jj) Generalized eigenvectors are CC-orthonormal: CCCC ii, ψψ jj δδ ii,jj 38
Examples Blue: PCA Green: LPP Note: Similarity matrix is defined by the nearestneighbor-based method with 50 nearest neighbors. LPP can describe the data well, and also it preserves cluster structure. LPP is intuitive, easy to implement, analytically computable, and fast. 39
Examples (cont.) Embedding handwritten numerals from 3 to 8 into a 2- dimensional subspace. Each image consists of 16x16 pixels. 40
Examples (cont.) LPP finds (slightly) clearer clusters than PCA PCA LPP 41
Drawbacks of LPP Obtained results highly depend on the similarity matrix WW. Appropriately designing the similarity matrix (e.g., kk, γγ) is not always easy. 42
Local scaling of samples Densities of samples may be locally different. Dense region Sparse region Using the same γγ globally in the similarity matrix may not be appropriate. WW ii,jj = exp xx ii xx jj 2 /γγ 2, γγ > 0 43
Local scaling heuristic γγ ii : scaling around the sample xx ii (kk) γγ ii = xx ii xx ii xx ii (kk) : k-th nearest neighbor sample of xxii Local scaling based similarity matrix 2 WW ii,jj = exp xx ii xx jj /(γγii γγ jj ) A heuristic choice is kk = 7 L. Zelnik-Manor & P. Perona, Self-tuning spectral clustering, Advances in Neural Information Processing Systems 17, 1601-1608, MIT Press, 2005. 44
Graph theory Graph: A set of vertices and edges. Adjacency matrix WW : WW ii,jj is the number of edges from ii-th to jj-th vertices. Vertex degree dd ii : Number of connected edges at ii-th vertex. 45
Spectral graph theory Spectral graph theory studies relationships between the properties of a graph and its adjacency matrix. Graph Laplacian LL : dd ii (ii = jj) LL ii,jj = 1 (ii jj and WW ii,jj > 0) 0 (otherwise) 46
Relation to spectral graph theory Suppose our similarity matrix WW is defined based on nearest neighbors. Consider the following graph: Each vertex corresponds to each point xx ii. An edge exists if WW ii,jj > 0 WW is the adjacency matrix. DD is the diagonal matrix of vertex degrees. LL is the graph Laplacian. 47
Homework 1. Prove nn 2 WW ii,jj BBxx ii BBBB jj = 2tr BBBBBBXX BB ii,jj=1 XX = xx 1 xx 2 xx nn LL = DD WW nn nn DD = diag jj=1 WW 1,jj,, jj=1 WW nn,jj 48
Homework (cont.) 2. Let BB: mm dd matrix (1 mm dd) CC, DD: dd dd matrix, positive definite, symmetric mm λλ ii, ψψ ii ii=1: Sorted generalized eigenvalues and normalized eigenvectors of CCCC = λλdddd λλ 1 λλ 2 λλ dd, DDψψ ii, ψψ jj = δδ ii,jj Then, prove that a solution of BB mmmmmm = arg min tr BBBBBB BB Rmm dd is given by subject to BBBBBB = II mm BB mmmmmm = ψψ dd ψψ dd 1 ψψ dd mm+1 49
Homework (cont.) 3. (Optional) Implement LPP and reproduce the 2- dimensional examples shown in the class. Datasets 1 and 2 are available from http://www.ms.k.u-tokyo.ac.jp/sugi/data/dataanalysis/ Test LPP on your own (artificial or real) data and analyze characteristics of LPP. 50
51 51
Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 52
Supervised dimensionality reduction The best embedding is unknown in general. If every sample has a class label, the best embedding is the one such that samples in different classes are well separated. Better for representing large variances Which is the best??? Better for representing local structures 53
Supervised dimensionality reduction nn Samples xx ii ii=1 nn have class labels yy ii ii=1 xx ii, yy ii nn ii=1 xx ii R dd yy ii {1,2,, cc} We want to obtain an embedding such that samples in different classes are well separated from each other. 54
Within-class scatter matrix Sum of scatters within each class cc SS (ww) = xx ii μμ yy xx ii μμ yy yy=1 ii:yy ii =yy Mean of samples in class yy μμ yy = 1 nn yy ii:yy ii =yy xx ii # samples in class yy 55
Between-class scatter matrix Sum of scatters between classes cc SS (bb) = nn yy μμ yy μμ μμ yy μμ yy=1 Mean of samples in class yy μμ yy = 1 nn yy nn μμ = 1 nn ii ii:yy ii =yy xx ii xx ii Mean of all samples 56
Fisher discriminant analysis (FDA) Idea: Minimize within-class scatter and maximize between-class scatter by maximizing tr BBSS ww BB 1 BBSS (bb) BB To disable arbitrary scaling, we impose BBSS (ww) BB = II mm FDA criterion: BB FFFFFF = arg max tr BB R BBSS(bb) BB mm dd subject to BBSS (ww) BB = II mm 57
FDA: Summary FDA criterion: BB FFFFFF = arg max tr BB R BBSS(bb) BB mm dd subject to BBSS (ww) BB = II mm Solution: BB FFFFFF = ψψ 1 ψψ 2 ψψ mm λλ ii, ψψ mm ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of SS (bb) ψψ = λλss (ww) ψψ λλ 1 λλ 2 λλ dd, SS (ww) ψψ ii, ψψ jj = δδ ii,jj FDA embedding of sample xx : zz = BB FFFFFF xx 58
Examples of FDA FDA can find an appropriate subspace. 59
Examples of FDA (cont.) However, FDA does not work well if samples in a class have multi-modality. 60
Dimensionality of embedding space It holds rank SS (bb) dd This means that λλ ii ii=cc λλ 1 λλ 2 λλ dd cc 1 (Homework) are always zero. Due to multiplicity of eigenvalues, dd eigenvectors ψψ ii ii=cc can be arbitrarily rotated in the null space of SS (bb). Thus, FDA essentially requires mm cc 1. When cc = 2, mm cannot be larger than 1! 61
Local Fisher discriminant analysis (LFDA) Idea: Take the locality of data into account. 1. Nearby samples in the same class are made close. 2. Samples in different classes are made apart. 3. Far-apart samples in the same class can be ignored. 1. 2. M. Sugiyama: Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis, JMLR, 8(May), 2007. 3. 62
Pairwise expressions of scatters (Homework) nn SS (ww) = 1 2 ii,jj=1 nn SS (bb) = 1 2 ii,jj=1 Samples in the same class are made close. (ww) QQ ii,jj xxii xx jj (bb) QQ ii,jj xxii xx jj xx ii xx jj (ww) 1/nn yy (yy ii = yy jj = yy) QQ ii,jj = 0 (yy ii yy jj ) xx ii xx jj (bb) 1/nn 1/nn yy (yy ii = yy jj = yy) QQ ii,jj = 1/nn (yy ii yy jj ) Samples in different classes are made apart. 63
Locality-aware scatters nn SS (llll) = 1 2 ii,jj=1 nn SS (llll) = 1 2 ii,jj=1 (llll) QQ ii,jj xxii xx jj (llll) QQ ii,jj xxii xx jj xx ii xx jj (llll) WW ii,jj /nn yy (yy ii = yy jj = yy) QQ ii,jj = 0 (yy ii yy jj ) QQ ii,jj Nearby samples in the same class are made close. xx ii xx jj WW ii.jj : similarity matrix (llll) WW ii,jj (1/nn 1/nn yy ) (yy ii = yy jj = yy) = 1/nn (yy ii yy jj ) Samples in different classes are made apart. 64
LFDA: Summary LFDA criterion: BB LLLLLLLL = arg max tr BB R BBSS(llll) BB mm dd subject to BBSS (llll) BB = II mm Solution: BB LLLLLLLL = ψψ 1 ψψ 2 ψψ mm λλ ii, ψψ mm ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of SS (llll) ψψ = λλss (llll) ψψ λλ 1 λλ 2 λλ dd, SS (llll) ψψ ii, ψψ jj = δδ ii,jj FDA embedding of sample xx : zz = BB LLLLLLLL xx 65
Examples of LFDA Similarity matrix = nearest-neighbor method with 50 nearest neighbors. LFDA works well even for samples with within-class multi-modality. Since rank SS (llll) cc in general, thus mm can be larger in LFDA. 66
Examples of FDA/LFDA Thyroid disease data (5-dimensional) Representing several statistics obtained from blood tests. Label: Healthy or Sick Sick can be caused by Hyper-functioning of thyroid (too much working) Hypo-functioning of thyroid (too little working) 67
Projected samples onto 1-d space Sick and healthy are not separated. Hyper- and hypo-functioning are completely mixed. Sick and healthy are nicely separated. Hyper- and hypo-functioning are also well separated. 68
70