Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Size: px

Start display at page:

Download "Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota"

Kerry Spencer
5 years ago
Views:

1 Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota NASCA 2018, Kalamata July 6, 2018

2 Introduction, background, and motivation Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; optimization;... In this talk: emphasis on dimension reduction techniques and the interrelations between techniques Kalamata 07/06/18 p. 2

3 Introduction: a few factoids We live in an era increasingly shaped by DATA bytes of data created in % of data on internet created since Billion internet users in Million Google searches worldwide / minute (5.2 B/day) 15.2 Million text messages worldwide / minute Mixed blessing: Opportunities & big challenges. Trend is re-shaping & energizing many research areas including : numerical linear algebra Kalamata 07/06/18 p. 3

4 A huge potential: Health sciences Kalamata 07/06/18 p. 4

5 Recommending books or movies: recommender systems Kalamata 07/06/18 p. 5

6 A few sample (classes of) problems: Classification: Benign Malignant, Dangerous-Safe, Face recognition, digit recognition, Matrix completion: Recommender systems Projection type methods: PCA, LSI, Clustering, Eigenmaps, LLE, Isomap,... Problems for graphs/ networks: Pagerank, analysis of graphs (node centrality,...) Problems from computational statistics [Trace (inv(cov)), Log det (A), Log Likelyhood,...] Kalamata 07/06/18 p. 6

7 A common tool: Dimension reduction Dimension Reduction PLEASE! Picture modified from Kalamata 07/06/18 p. 7

8 Major tool of Data Mining: Dimension reduction Goal is not as much to reduce size (& cost) but to: Reduce noise and redundancy in data before performing a task [e.g., classification as in digit/face recognition] Discover important features or paramaters The problem: Given: X = [x 1,, x n ] R m n, find a low-dimens. representation Y = [y 1,, y n ] R d n of X Achieved by a mapping Φ : x R m y R d so: φ(x i ) = y i, i = 1,, n Kalamata 07/06/18 p. 8

9 n m X x i d Y y n i Φ may be linear : y j = W x j, j, or, Y = W X... or nonlinear (implicit). Mapping Φ required to: Preserve proximity? Maximize variance? Preserve a certain graph? Kalamata 07/06/18 p. 9

10 Basics: Principal Component Analysis (PCA) In Principal Component Analysis W is computed to maximize variance of projected data: max W R m d ;W W =I n i=1 y i 1 n n j=1 y j 2 2, y i = W x i. Leads to maximizing Tr [ W (X µe )(X µe ) W ], µ = 1 n Σn i=1 x i Solution W = { dominant eigenvectors } of the covariance matrix Set of left singular vectors of X = X µe Kalamata 07/06/18 p. 10

11 SVD: X = UΣV, U U = I, V V = I, Σ = Diag Optimal W = U d matrix of first d columns of U Solution W also minimizes reconstruction error.. i x i W W T x i 2 = i x i W y i 2 In some methods recentering to zero is not done, i.e., X replaced by X. Kalamata 07/06/18 p. 11

12 Unsupervised learning Unsupervised learning : methods do not exploit labeled data Example of digits: perform a 2-D projection Images of same digit tend to cluster (more or less) Such 2-D representations are popular for visualization Can also try to find natural clusters in data, e.g., in materials Basic clusterning technique: K- means PCA digits : Superconductors Multi ferroics Superhard Catalytic Photovoltaic Ferromagnetic Thermo electric Kalamata 07/06/18 p. 12

13 Example: Digit images (a random sample of 30) Kalamata 07/06/18 p. 13

14 2-D reductions : PCA digits : LLE digits : K PCA digits : ONPP digits : x 10 3 Kalamata 07/06/18 p. 14

15 DIMENSION REDUCTION EXAMPLE: INFORMATION RETRIEVAL

16 Application: Information Retrieval Given: collection of documents (columns of a matrix A) and a query vector q. Representation: m n term by document matrix A query q is a (sparse) vector in R m ( pseudo-document ) Problem: find a column of A that best matches q Vector space model: use cos (A(:, j), q), j = 1 : n Requires the computation of A T q Literal Matching ineffective Kalamata 07/06/18 p. 16

17 Common approach: Dimension reduction (SVD) LSI: replace A by a low rank approximation [from SVD] A = UΣV T A k = U k Σ k V T k Replace similarity vector: s = A T q by s k = A T k q Main issues: 1) computational cost 2) Updates Idea: Replace A k by Aφ(A T A), where φ == a filter function Consider the stepfunction (Heaviside): φ(x) = { 0, 0 x σ 2 k 1, σ 2 k x σ2 1 Would yield the same result as TSVD but not practical Kalamata 07/06/18 p. 17

18 Use of polynomial filters Solution : use a polynomial approximation to φ Note: s T = q T Aφ(A T A), requires only Mat-Vec s Ideal for situations where data must be explored once or a small number of times only Details skipped see: E. Kokiopoulou and YS, Polynomial Filtering in Latent Semantic Indexing for Information Retrieval, ACM-SIGIR, Kalamata 07/06/18 p. 18

19 IR: Use of the Lanczos algorithm (J. Chen, YS 09) Lanczos algorithm = Projection method on Krylov subspace Span{v, Av,, A m 1 v} Can get singular vectors with Lanczos, & use them in LSI Better: Use the Lanczos vectors directly for the projection K. Blom and A. Ruhe [SIMAX, vol. 26, 2005] perform a Lanczos run for each query [expensive]. Proposed: One Lanczos run- random initial vector. Then use Lanczos vectors in place of singular vectors. In summary: lower cost. Results comparable to those of SVD at a much Kalamata 07/06/18 p. 19

20 Supervised learning We now have data that is labeled Example: (health sciences) malignant - non malignant Example: (materials) photovoltaic, hard, conductor,... Example: (Digit recognition) Digits 0, 1,..., 9 d e c f a b g Kalamata 07/06/18 p. 20

21 Supervised learning We now have data that is labeled Example: (health sciences) malignant - non malignant Example: (materials) photovoltaic, hard, conductor,... Example: (Digit recognition) Digits 0, 1,..., 9 d e c f d e c f?? a b g a b g Kalamata 07/06/18 p. 21

22 Supervised learning: classification Best illustration: written digits recognition example Given: a set of labeled samples (training set), and an (unlabeled) test image. Problem: find label of test image Digit 0 Digit 0 Digfit 1 Digit Training data Digfit 1 Digit 2 Digit 9 Digit 9 Digit?? Test data Digit?? Dimension reduction Roughly speaking: we seek dimension reduction so that recognition is more effective in low-dim. space Kalamata 07/06/18 p. 22

23 Basic method: K-nearest neighbors (KNN) classification Idea of a voting system: get distances between test sample and training samples Get the k nearest neighbors (here k = 8) Predominant class among these k items is assigned to the test sample ( here)? Kalamata 07/06/18 p. 23

24 Supervised learning: Linear classification Linear classifiers: Find a hyperplane which best separates the data in classes A and B. Example of application: Distinguish between SPAM and non-spam e- mails Linear classifier Note: The world in non-linear. Often this is combined with Kernels amounts to changing the inner product Kalamata 07/06/18 p. 24

25 A harder case: 4 Spectral Bisection (PDDP) Use kernels to transform

26 0.1 Projection with Kernels σ 2 = Transformed data with a Gaussian Kernel Kalamata 07/06/18 p. 26

27 GRAPH-BASED TECHNIQUES

28 Graph-based methods Start with a graph of data. e.g.: graph of k nearest neighbors (k-nn graph) Want: Perform a projection which preserves the graph in some sense Define a graph Laplacean: L = D W x i x j e.g.,: w ij = { 1 if j Adj(i) 0 else D = diag d ii = w ij j i with Adj(i) = neighborhood of i (excluding i) Kalamata 07/06/18 p. 28

29 A side note: Graph partitioning If x is a vector of signs (±1) then x Lx = 4 ( number of edge cuts ) edge-cut = pair (i, j) with x i x j Consequence: Can be used for partitioning graphs, or clustering [take p = sign(u 2 ), where u 2 = 2nd smallest eigenvector..] Kalamata 07/06/18 p. 29

30 A few properties of graph Laplacean matrices Let L = any matrix s.t. L = D W, with: D = diag(d i ), w ij 0, d i = j i w ij Property 1: for any x R n : x Lx = 1 2 i,j w ij x i x j 2 Property 2: (generalization) for any Y R d n : Tr [Y LY ] = 1 w ij y i y j 2 2 i,j Kalamata 07/06/18 p. 30

31 Property 3: For the particular L = I 1 n 11 XLX = X X == n [Sample Covariance matrix] [Proof: 1) L is a projector: L L = L 2 = L, and 2) XL = X] PCA equivalent to maximizing ij y i y j 2 Property 4: (Graph partitioning) If x is a vector of signs (±1) and an edge-cut = pair (i, j) with x i x j then x Lx = 4 ( number of edge cuts ) Can be used for partitioning graphs, or for clustering [take p = sign(u 2 ), where u 2 = 2nd smallest eigenvector..] Kalamata 07/06/18 p. 31

32 Example: The Laplacean eigenmaps approach Laplacean Eigenmaps [Belkin-Niyogi 01] *minimizes* n F(Y ) = w ij y i y j 2 subject to Y DY = I i,j=1 Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-dim. data) Original data used indirectly through its graph Leads to n n sparse eigenvalue problem [In sample space] x i x j y i y j Kalamata 07/06/18 p. 32

33 Problem translates to: min Y R d n Y D Y = I Tr [ Y (D W )Y ]. Solution (sort eigenvalues increasingly): (D W )u i = λ i Du i ; y i = u i ; i = 1,, d Note: can assume D = I. Amounts to rescaling data. Problem becomes (I W )u i = λ i u i ; y i = u i ; i = 1,, d Kalamata 07/06/18 p. 33

34 Why smallest eigenvalues vs largest for PCA? Intuition: Graph Laplacean and unit Laplacean are very different: one involves a sparse graph (More like a discr. differential operator). The other involves a dense graph. (More like a discr. integral operator). They should be treated as the inverses of each other. Viewpoint confirmed by what we learn from Kernel approach Kalamata 07/06/18 p. 34

35 Locally Linear Embedding (Roweis-Saul-00) LLE is very similar to Eigenmaps. Main differences: 1) Graph Laplacean matrix is replaced by an affinity graph 2) Objective function is changed. 1. Graph: Each x i is written as a convex combination of its k nearest neighbors: x i Σw ij x j, j N i w ij = 1 Optimal weights computed ( local calculation ) by minimizing x i Σw ij x j for i = 1,, n x i x j Kalamata 07/06/18 p. 35

36 2. Mapping: The y i s should obey the same affinity as x i s Minimize: y i j i w ij y j 2 subject to: Y 1 = 0, Y Y = I Solution: (I W )(I W )u i = λ i u i ; y i = u i. (I W )(I W ) replaces the graph Laplacean of eigenmaps Kalamata 07/06/18 p. 36

37 Locally Preserving Projections (He-Niyogi-03) LPP is a linear dimensionality reduction technique n Recall the setting: Want V R m d ; Y = V X d m v T m X Y x y i i d Starts with the same neighborhood graph as Eigenmaps: L D W = graph Laplacean ; with D diag({σ i w ij }). n Kalamata 07/06/18 p. 37

38 Optimization problem is to solve min Y R d n, Y DY =I Σ i,j w ij y i y j 2, Y = V X. Difference with eigenmaps: Y is a projection of X data Solution (sort eigenvalues increasingly) XLX v i = λ i XDX v i y i,: = v i X Note: essentially same method in [Koren-Carmel 04] called weighted PCA [viewed from the angle of improving PCA] Kalamata 07/06/18 p. 38

39 ONPP (Kokiopoulou and YS 05) Orthogonal Neighborhood Preserving Projections A linear (orthogonoal) version of LLE obtained by writing Y in the form Y = V X Same graph as LLE. Objective: preserve the affinity graph (as in LEE) *but* with the constraint Y = V X Problem solved to obtain mapping: min V Tr s.t. V T V = I [ V X(I W )(I W )X V ] In LLE replace V X by Y Kalamata 07/06/18 p. 39

40 Implicit vs explicit mappings In PCA the mapping Φ from high-dimensional space (R m ) to low-dimensional space (R d ) is explicitly known: y = Φ(x) V T x In Eigenmaps and LLE we only know Mapping φ is complex, i.e., y i = φ(x i ), i = 1,, n Difficult to get φ(x) for an arbitrary x not in the sample. Inconvenient for classification The out-of-sample extension problem Kalamata 07/06/18 p. 40

41 DEMO

42 K-nearest neighbor graphs Nearest Neighbor graphs needed in: data mining, manifold learning, robot motion planning, computer graphics,... Given: a set of n data points X = {x 1,..., x n } vertices Given: a proximity measure between two data points x i and x j as measured by a quantity ρ(x i, x j ) Want: For each point x i a list of the nearest neighbors of x i (edges between x i and these nodes). (wll make the definition less vague shortly) Kalamata 07/06/18 p. 42

43 Building a nearest neighbor graph Problem: Build a nearest-neighbor graph from given data Data Graph Will demonstrate the power of the Lanczos algorithm combined with a divide a conquer approach. Kalamata 07/06/18 p. 43

44 Two types of nearest neighbor graph often used: ɛ-graph: Edges consist of pairs (x i, x j ) such that ρ(x i, x j ) ɛ knn graph: Nodes adjacent to x i are those nodes x l with the k with smallest distances ρ(x i, x l ). ɛ-graph is undirected and is geometrically motivated. Issues: 1) may result in disconnected components 2) what ɛ? knn graphs are directed in general (can be trivially fixed). knn graphs especially useful in practice. Kalamata 07/06/18 p. 44

45 Divide and conquer KNN: key ingredient Key ingredient is Spectral bisection Let the data matrix X = [x 1,..., x n ] R d n Each column == a data point. Center the data: ˆX = [ˆx 1,..., ˆx n ] = X ce T where c == centroid; e = ones(d, 1) (matlab) Goal: Split ˆX into halves using a hyperplane. Method: Principal Direction Divisive Partitioning D. Boley, 98. Idea: Use the (σ, u, v) = largest singular triplet of ˆX with: u T ˆX = σv T. Kalamata 07/06/18 p. 45

46 Hyperplane is defined as u, x = 0, i.e., it splits the set of data points into two subsets: X + = {x i u T ˆx i 0} and X = {x i u T ˆx i < 0}. u + SIDE SIDE Hyperplane Note that u T ˆx i = u T ˆXe i = σv T e i

47 X + = {x i v i 0} and X = {x i v i < 0}, where v i is the i-th entry of v. In practice: replace above criterion by X + = {x i v i med(v)} & X = {x i v i < med(v)} where med(v) == median of the entries of v. For largest singular triplet (σ, u, v) of ˆX : use Golub- Kahan-Lanczos algorithm or Lanczos applied to ˆX ˆX T or ˆX T ˆX Cost (assuming s Lanczos steps) : O(n d s) ; Usually: d very small Kalamata 07/06/18 p. 47

48 Two divide and conquer algorithms divide current set into two overlapping sub- Overlap method: sets X 1, X 2 Glue method: divide current set into two disjoint subsets X 1, X 2 plus a third set X 3 called gluing set. hyperplane hyperplane X 1 X 2 X 1 X 3 X 2 Kalamata 07/06/18 p. 48

49 The Overlap Method Divide current set X into two overlapping subsets: X 1 = {x i v i h α (S v )} and X 2 = {x i v i < h α (S v )}, where S v = { v i i = 1, 2,..., n}. and h α ( ) is a function that returns an element larger than (100α)% of those in S v. Rationale: to ensure that the two subsets overlap (100α)% of the data, i.e., X 1 X 2 = α X. Kalamata 07/06/18 p. 49

50 The Glue Method Divide the set X into two disjoint subsets X 1 and X 2 with a gluing subset X 3 : X 1 X 2 = X, X 1 X 2 =, X 1 X 3, X 2 X 3. Criterion used for splitting: X 1 = {x i v i 0}, X 2 = {x i v i < 0}, X 3 = {x i h α (S v ) v i < h α (S v )}. Note: gluing subset X 3 here is just the intersection of the sets X 1, X 2 of the overlap method. Kalamata 07/06/18 p. 50

51 Approximate knn Graph Construction: The Overlap Method 1: function G = knn-overlap(x, k, α) 2: if X < n k then 3: G knn-bruteforce(x, k) 4: else 5: (X 1, X 2 ) DIVIDE-OVERLAP(X, α) 6: G 1 knn-overlap(x 1, k, α) 7: G 2 knn-overlap(x 2, k, α) 8: G CONQUER(G 1, G 2 ) 9: REFINE(G) 10: end if 11: end function Kalamata 07/06/18 p. 51

52 Approximate knn Graph Construction: The Glue Method 1: function G = knn-glue(x, k, α) 2: if X < n k then 3: G knn-bruteforce(x, k) 4: else 5: (X 1, X 2, X 3 ) DIVIDE-GLUE(X, α) 6: G 1 knn-glue(x 1, k, α) 7: G 2 knn-glue(x 2, k, α) 8: G 3 knn-glue(x 3, k, α) 9: G CONQUER(G 1, G 2, G 3 ) 10: REFINE(G) 11: end if 12: end function Kalamata 07/06/18 p. 52

53 Theorem where The time complexity for the overlap method is t o = log 2/(1+α) 2 = T o (n) = Θ(dn t o ), (1) 1 1 log 2 (1 + α). (2) Theorem The time complexity for the glue method is where t g is the solution to the equation: T g (n) = Θ(dn t g /α), (3) 2 2 t + α t = 1. Example: When α = 0.1, then t o = 1.16 while t g = Kalamata 07/06/18 p. 53

54 Multilevel techniques in brief Divide and conquer paradigms as well as multilevel methods in the sense of domain decomposition Main principle: very costly to do an SVD [or Lanczos] on the whole set. Why not find a smaller set on which to do the analysis without too much loss? Tools used: graph coarsening, divide and conquer For text data we use hypergraphs Kalamata 07/06/18 p. 54

55 Multilevel Dimension Reduction Main Idea: coarsen for a few levels. Use the resulting data set ˆX to find a projector P from R m to R d. P can be used to project original data or new data Graph... Coarsen Last Level Coarsen n X m m X r n r Project n Y Yr n r d d Gain: Dimension reduction is done with a much smaller set. Hope: not much loss compared to using whole data Kalamata 07/06/18 p. 55

56 Making it work: Use of Hypergraphs for sparse data * * * * a * * * * b A = * * * * c * * * d * * e e net e c a b 5 d 6 net d Left: a (sparse) data set of n entries in R m represented by a matrix A R m n Right: corresponding hypergraph H = (V, E) with vertex set V representing to the columns of A. Kalamata 07/06/18 p. 56

57 Hypergraph Coarsening uses column matching similar to a common one used in graph partitioning Compute the non-zero inner product a (i), a (j) between two vertices i and j, i.e., the ith and jth columns of A. Note: a (i), a (j) = a (i) a (j) cos θ ij. Modif. 1: Parameter: 0 < ɛ < 1. Match two vertices, i.e., columns, only if angle between the vertices satisfies: tan θ ij ɛ Modif. 2: Scale coarsened columns. If i and j matched and if a (i) 0 a (j) 0 replace a (i) and a (j) by ) c (l) = ( 1 + cos 2 θ ij a (i) Kalamata 07/06/18 p. 57

58 Call C the coarsened matrix obtained from A using the approach just described Lemma: Let C R m c be the coarsened matrix of A obtained by one level of coarsening of A R m n, with columns a (i) and a (j) matched if tan θ i ɛ. Then x T AA T x x T CC T x 3ɛ A 2 F, for any x R m with x 2 = 1. Very simple bound for Rayleigh quotients for any x. Implies some bounds on singular values and norms - skipped. Kalamata 07/06/18 p. 58

59 Tests: Comparing singular values Singular values 20 to 50 and approximations Exact power it1 coarsen coars+zs rand Singular values 20 to 50 and approximations Exact power it1 coarsen coars+zs rand Singular values 20 to 50 and approximations Exact power it1 coarsen coars+zs rand Results for the datasets CRANFIELD (left), MEDLINE (middle), and TIME (right). Kalamata 07/06/18 p. 59

60 Low rank approximation: Coarsening, random sampling, and rand+coarsening. Err1 = A H k Hk T A F ; Err2= 1 k k Dataset n k c Coarsen Rand Sampl Err1 Err2 Err1 Err2 ˆσ i σ i σ i Kohonen aft FA chipcool brainpc scfxm1-2b thermomechtc Webbase-1M Kalamata 07/06/18 p. 60

61 Conclusion *Many* interesting new matrix problems in areas that involve the effective mining of data Among the most pressing issues is that of reducing computational cost - [SVD, SDP,..., too costly] Many online resources available Huge potential in areas like materials science though inertia has to be overcome To a researcher in computational linear algebra : Tsunami of change on types or problems, algorithms, frameworks, culture,.. But change should be welcome

62 When one door closes, another opens; but we often look so long and so regretfully upon the closed door that we do not see the one which has opened for us. Alexander Graham Bell ( ) In the words of Lao Tzu: If you do not change directions, you may end-up where you are heading Thank you! Visit my web-site at Kalamata 07/06/18 p. 62

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Université du Littoral- Calais July 11, 16 First..... to the