Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota

Size: px

Start display at page:

Download "Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota"

Myra Gordon
5 years ago
Views:

1 Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota NUMAN-08 Kalamata, 08

2 Team members involved in this work Support Past: Efi Kokiopoulou [Now at the U. of Lausanne] Current: Jie Chen [grad student] Sofia Sakellaridi [grad student] Haw-Ren Fang [Post-Doc] Support: National Science Foundation Kalamata, p. 2

3 Introduction, background, and motivation Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; optimization;... In this talk: emphasis on dimension reduction techniques and the interrelations between techniques Kalamata, p. 3

4 Focus on two main problems Information retrieval Face recognition and 3 types of dimension reduction methods Standard subspace methods [SVD, Lanczos] Graph-based methods multilevel methods Kalamata, p. 4

5 The problem Given d m find a mapping Φ : x R m y R d Mapping may be explicit (e.g., y = V T x) Or implicit (nonlinear) Practically: Given X R m n, we want to find a low-dimensional representation Y R d n of X Two classes of methods: (1) projection techniques and (2) nonlinear implicit methods. Kalamata, p. 5

6 Example 1: The Swill-Roll (00 points in 3-D) Original Data in 3 D Kalamata, p. 6

7 2-D reductions : 15 PCA 15 LPP Eigenmaps ONPP Kalamata, p. 7

8 Example 2: Digit images (a sample of 30) Kalamata, p. 8

9 2-D reductions : PCA digits : LLE digits : K PCA digits : ONPP digits : x 3 Kalamata, p. 9

10 Projection-based Dimensionality Reduction Given: a data set X = [x 1, x 2,..., x n ], and d the dimension of the desired reduced space Y. Want: a linear transformation from X to Y n d m v T m X Y n d X R m n V R m d Y = V X Y R d n m-dimens. objects (x i ) flattened to d-dimens. space (y i ) Constraint: The y i s must satisfy certain properties Optimization problem Kalamata, p.

11 Linear Dimensionality Reduction: PCA In PCA projected data must have maximum variance, i.e., we need to maximize over all orthogonal m d matrices V : i y i 1 n j y j 2 2 = = Tr [ V X X V ] Where: X = X(I 1 n 11T ) == origin-recentered version of X Solution V = { dominant eigenvectors } of the covariance matrix == Set of left singular vectors of X Solution V also minimizes reconstruction error.. x i V V T x i 2 = x i V y i 2 i i.. and it also maximizes [Korel and Carmel 04] i,j y i y j 2 Kalamata, p. 11

12 Information Retrieval: Vector Space Model Given: 1) set of documents (columns of a matrix A); 2) a query vector q. Entry a ij of A = frequency of term i in document j + weighting. Terms Documents Queries ( pseudo-documents ) q represented similarly to columns Problem: find columns of A that best match q Kalamata, p. 12

13 Vector Space Model and the Truncated SVD Similarity metric: angle between column A j,: and query q Use Cosines: q T A :,j A :,j 2 q 2 To rank all documents compute the similarity vector: s = A T q Not very effective. Problems : polysemy, synonymy,... LSI: replace matrix A by low rank approximation A = UΣV T A k = U k Σ k V T k s k = A T k q U k : term space, V k : document space. Called TSVD Expensive, hard to update,.. Kalamata, p. 13

14 New similarity vector: s k = A T k q = V kσ k U T k Issues: How to select k? Computational cost (memory + computation) Problem with updates Kalamata, p. 14

15 Alternative: SDD; Less memory but cost still an issue. Alternative: polynomial approximation. s k φ k (A T A)A T q where φ k = deg. k polynom. Yet another alternative: use Lanczos vectors instead of singular vectors [Ruhe and Blom, 05] Kalamata, p. 15

16 IR: Use of the Lanczos algorithm * Joint work with Jie Chen Lanczos is good at catching large (and small) eigenvalues: can compute singular vectors with Lanczos, & use them in LSI Can do better: Use the Lanczos vectors directly for the projection.. First advocated by: K. Blom and A. Ruhe [SIMAX, vol. 26, 05]. Use Lanczos bidiagonalization. Use a similar approach But directly with AA T or A T A. Kalamata, p. 16

17 IR: Use of the Lanczos algorithm (1) Let A R m n. Apply the Lanczos procedure to M = AA T. Result: Q T k AAT Q k = T k with Q k orthogonal, T k tridiagonal. Define s i orth. projection of Ab on subspace span{q i } s i := Q i Q T i Ab. s i can be easily updated from s i 1 : s i = s i 1 + q i q T i Ab. Kalamata, p. 17

18 IR: Use of the Lanczos algorithm (2) If n < m it may be more economial to apply Lanczos to M = A T A which is n n. Result: Q T k AT A Q k = T k Define: t i := A Q i Q T i b, Project b first before applying A to result. Kalamata, p. 18

19 Why does this work? First, recall a result on Lanczos algorithm [YS 83] Let {λ j, u j } = j-th eigen-pair of M (label ) where (I Q k Q T k )u j Q k Q T k u j K j T k j (γ j ) γ j = λ j λ j+1 λ j+1 λ n, K j = (I Q 1 Q T 1 )u j Q 1 Q T 1 u, j 1 j = 1, j 1 i=1 λ i λ n λ i λ j j 1 and T l (x) = Chebyshev polynomial of 1st kind of degree l. This has the form (I Q k Q T k )u j c j /T k j (γ j ), where c j = constant independent of k Kalamata, p. 19

20 Result: Distance between unit eigenvector u j and Krylov subspace span(q k ) decays fast (for small j) Consider component of difference between Ab s k along left singular directions of A. If A = UΣV T, then u j s (columns of U) are eigenvectors of M = AA T. So: Ab s k, u j = (I Q k Q T k )Ab, u j = (I Q k Q T k )u j, Ab (I Q k Q T k )u j Ab c j Ab T 1 k j (γ j) {s i } converges rapidly to Ab in directions of the major left singular vectors of A. Kalamata, p.

21 Similar result for left projection sequence t j Here is a typical distribution of eigenvalues of M: [Matrix of size ] y λ 5 λ 3 0 λ 4 λ 2 λ 1 x Convergence toward first few singular vectors very fast Kalamata, p. 21

22 Advantages of Lanczos over polynomial filters: (1) No need for eigenvalue estimates (2) Mat-vecs performed only in preprocessing Disadvantages: (1) Need to store Lanczos vectors; (2) Preprocessing must be redone when A changes. (3) Need for reorthogonalization expensive for large k. Kalamata, p. 22

23 Tests: IR Information retrieval datasets # Terms # Docs # queries sparsity MED 7,014 1, CRAN 3,763 1, Med dataset. Cran dataset. Preprocessing times preprocessing time lanczos tsvd Med iterations preprocessing time lanczos tsvd Cran iterations Kalamata, p. 23

24 Average query times Med dataset Cran dataset. average query time 5 x Med 3 lanczos tsvd lanczos rf 4 tsvd rf average query time 4 x Cran 3 lanczos tsvd lanczos rf tsvd rf iterations iterations Kalamata, p. 24

25 Average retrieval precision Med dataset Cran dataset 0.9 Med 0.9 Cran average precision average precision iterations lanczos tsvd lanczos rf tsvd rf iterations lanczos tsvd lanczos rf tsvd rf Retrieval precision comparisons Kalamata, p. 25

26 In summary: Results comparable to those of SVD..... at a much lower cost. [However not for the same dimension k] Thanks: Helpful tools and datasets widely available. We used TMG [developed at the U. of Patras (D. Zeimpekis, E. Gallopoulos)] Kalamata, p. 26

27 Problem 2: Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. ÿ ÿ ÿ ÿ ÿ ÿ Question: database? Does this new image correspond to one of those in the Kalamata, p. 27

28 Difficulty Different positions, expressions, lighting,..., situations : Common approach: eigenfaces technique Principal Component Analysis Kalamata, p. 28

29 Example: Occlusion. See recent paper by John Wright et al. Right: 50% pixels randomly changed See also: Recent real-life example international man-hunt Kalamata, p. 29

30 Eigenfaces Consider each picture as a one-dimensional colum of all pixels Put together into an array A of size # pixels # images.... =... } {{ } A Do an SVD of A and perform comparison with any test image in low-dim. space Similar to LSI in spirit but data is not sparse. Idea: replace SVD by Lanczos vectors (same as for IR) Kalamata, p. 30

Tests: Face Recognition Tests with 2 well-known data sets: ORL 40 subjects, sample images each example: # of pixels : 112 92 TOT.

31 Tests: Face Recognition Tests with 2 well-known data sets: ORL 40 subjects, sample images each example: # of pixels : TOT. # images : 400 AR set 126 subjects 4 facial expressions selected for each [natural, smiling, angry, screaming] example: # of pixels : # TOT. # images : 504 Kalamata, p. 31

32 Tests: Face Recognition Recognition accuracy of Lanczos approximation vs SVD ORL dataset AR dataset ORL lanczos tsvd AR lanczos tsvd average error rate average error rate iterations iterations Vertical axis shows average error rate. Horizontal = Subspace dimension Kalamata, p. 32

33 GRAPH-BASED TECHNIQUES

34 Laplacean Eigenmaps (Belkin-Niyogi-02) Not a linear (projection) method but a Nonlinear method Starts with k-nearest-neighors graph Defines the graph Laplacean L = D W. Simplest: D = diag(deg(i)); w ij = 1 if j N i 0 else with N i = neighborhood of i (excl. i); deg(i) = N i x i x j Kalamata, p. 34

35 A few properties of graph Laplacean matrices Let L = any matrix s.t. L = D W, with D = diag(d i ) and w ij 0, d i = j i w ij Property 1: for any x R n : x Lx = 1 w ij x i x j 2 2 Property 2: (generalization) for any Y R d n : Tr [Y LY ] = 1 w ij y i y j 2 2 i,j i,j Kalamata, p. 35

36 Property 3: For the particular L = I 1 n 11 XLX = X X == n Covariance matrix [Proof: 1) L is a projector: L L = L 2 = L, and 2) XL = X] Consequence-1: PCA equivalent to maximizing ij y i y j 2 Consequence-2: what about replacing trivial L with something else? [viewpoint in Koren-Carmel 04] Kalamata, p. 36

37 Property 4: (Graph partitioning) If x is a vector of signs (±1) then x Lx = 4 ( number of edge cuts ) edge-cut = pair (i, j) with x i x j Consequence: Can be used for partitioning graphs, or clustering [take p = sign(u 2 ), where u 2 = 2nd smallest eigenvector..] Kalamata, p. 37

38 Return to Laplacean eigenmaps approach Laplacean Eigenmaps *minimizes* n F EM (Y ) = w ij y i y j 2 subject to Y DY = I. i,j=1 Notes: 1. Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-d data) 2. Note Min instead of Max as in PCA [counter-intuitive] 3. Above problem uses original data indirectly through its graph Kalamata, p. 38

39 Problem translates to: min Y R d n Y D Y = I Tr [ Y (D W )Y ]. Solution (sort eigenvalues increasingly): (D W )u i = λ i Du i ; y i = u i ; i = 1,, d Note: an n n sparse eigenvalue problem [In sample space] Note: can assume D = I. Amounts to rescaling data. Problem becomes (I W )u i = λ i u i ; y i = u i ; i = 1,, d Kalamata, p. 39

40 Why smallest eigenvalues vs largest for PCA? Intuition: Graph Laplacean and unit Laplacean are very different: one involves a sparse graph (More like a discr. differential operator). The other involves a dense graph. (More like a discr. integral operator). They should be treated as the inverses of each other. Viewpoint confirmed by what we learn from Kernel approach Kalamata, p. 40

41 Locally Linear Embedding (Roweis-Saul-00) LLE is very similar to Eigenmaps. Main differences: 1) Graph Laplacean matrix is replaced by an affinity graph 2) Objective function is changed: want to preserve graph 1. Graph: Each x i is written as a convex combination of its k nearest neighbors: x i Σw ij x j, j N i w ij = 1 Optimal weights computed ( local calculation ) by minimizing x i Σw ij x j for i = 1,, n x i x j Kalamata, p. 41

42 2. Mapping: The y i s should obey the same affinity as x i s Minimize: i y i j Solution: w ij y j 2 subject to: Y 1 = 0, Y Y = I (I W )(I W )u i = λ i u i ; y i = u i. (I W )(I W ) replaces the graph Laplacean of eigenmaps Kalamata, p. 42

43 Locally Preserving Projections (He-Niyogi-03) LPP is a linear dimensionality reduction technique n Recall the setting: m X Want V R m d ; Y = V X d m v T Y d Starts with the same neighborhood graph as Eigenmaps: L D W = graph Laplacean ; with D diag({σ i w ij }). n Kalamata, p. 43

44 Optimization problem is to solve min Y R d n, Y DY =I Σ i,j w ij y i y j 2, Y = V X. Difference with eigenmaps: Y is a projection of X data Solution (sort eigenvalues increasingly) XLX v i = λ i XDX v i y i,: = v i X Note: essentially same method in [Koren-Carmel 04] called weighted PCA [viewed from the angle of improving PCA] Kalamata, p. 44

45 ONPP (Kokiopoulou and YS 05) Orthogonal Neighborhood Preserving Projections Can be viewed as a linear version of LLE Uses the same graph as LLE. Objective: preserve the affinity graph (as in LEE) *but* by means of an orthogonal projection Objective function Φ(Y ) = Σ i y i Σ j w ij y j 2 Constraint: Y = V X, V V = I Notice that Φ(Y ) = Y Y W 2 F = = Tr [ V X(I W )(I W )X V ] Kalamata, p. 45

46 Resulting problem: min V R m d ; V V =I Tr V X(I W )(I W )X }{{} M V Solution: Columns of V = eigenvectors of M associated with smallest d eigenvalues Can be computed as d lowest left singular vectors of X(I W ) Kalamata, p. 46

47 A unified view Method Object. (min) Constraint PCA/MDS Tr [V X( I + ee )X V ] V V = I LLE Tr [Y (I W )(I W )Y ] Y Y = I Eigenmaps Tr [Y (I W )Y ] Y Y = I LPP Tr [V X(I W )X V ] V XX V = I ONPP Tr [V X(I W )(I W )X V ] V V = I LDA Tr [V X(I H)X V ] V XX V = I Kalamata, p. 47

48 Let M = I W = a Laplacean matrix ( I + ee for PCA/MDS); or the LLE matrix (I W )(I W ), or geodesic distance matrix (ISOMAP). All techniques lead to one of two types of problems First type is: min Y R d n Y Y = I Tr [ Y MY ] Y obtained from solving an eigenvalue problem LLE, Eigenmaps (normalized),.. Kalamata, p. 48

49 type is: And the second min V R m d V G V = I Tr [ V XMX V ]. G is either the identity matrix or XDX or XX. Low-Dim. data : Y = V X Important observation: 2nd is just a projected version of the 1st, i.e., approximate eigenvectors are sought in Span {X} [Rayleigh- Ritz procedure] Problem is of dim. m (dim. of data) not n (# of samples). This difference can be mitigated by resorting to Kernels.. Kalamata, p. 49

50 Graph-based methods in a supervised setting Subjects of training set are known (labeled). Q: given a test image (say) find its label. R. S. Varga R. S. Varga R. S. Varga R. S. Varga S. R. Smith S. R. Smith S. R. Smith S. R. Smith ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ J. R. Doe J. R. Doe J. R. Doe J. R. Doe Question: Find label (best match) for test image. Kalamata, p. 50

51 Methods can be adapted to supervised mode by building the graph to take into account class labels. Idea is simple: Build G so that nodes in the same class are neighbors. If c = # classes, G will consist of c cliques. Matrix W will be block-diagonal W = diag(w 1, W 2,, W c ) Easy to see that rank(w ) = n c. Can be used for LPP, ONPP, etc.. Kalamata, p. 51

52 TIME FOR A MATLAB DEMO

53 Supervised learning experiments: digit recognition Set of 390 images of digits ( of each digit) Each picture has 16 = pixels. Select any one of the digits and try to recognize it among the remaining images Methods: KNN, PCA, LPP, ONPP Kalamata, p. 53

54 One word about KNN-classifiers Simple idea of vote get k nearest neighbors. Assigned label = most common label among these neibhbors? Kalamata, p. 54

55 MULTILEVEL METHODS

56 Multilevel techniques Divide and conquer paradigms as well as multilevel methods in the sense of domain decomposition Main principle: very costly to do an SVD [or Lanczos] on the whole set. Why not find a smaller set on which to do the analysis without too much loss? Tools used: graph coarsening, divide and conquer Kalamata, p. 56

57 Multilevel techniques: Hypergraphs to the rescue General idea: Given X = [x 1,, x n ] R m n find another set ( coarsened set ) ˆX = [ˆx 1,, ˆx k ] R m k ˆX should be a good representative of X then find projector from R m to R d based on this subset Main tool used: graph coarsening. We will describe hypergraph-based techniques Kalamata, p. 57

58 Hypergraphs A hypergraph H = (V, E) is a generalizaition of a graph V = set of vertices V E = set of hyperedges. Each e E is a nonempty subset of V Standard undirected graphs: each e consists of two vertices. degree of e = e degree of a vertex v = number of hyperedges e s.t. x e. Two vertices are neighbors if there is a hyperedge connecting them Kalamata, p. 58

59 Example of a hypergraph e net e c a b 5 d 6 net d Boolean matrix representation a b A = c d 1 1 e Canonical hypergraph representation for sparse data (e.g. text) Kalamata, p. 59

60 Hypergraph Coarsening Coarsening a hypergraph H = (V, E) means finding a coarse approximation Ĥ = ( ˆV, Ê) to H with ˆV < V, which tries to retain as much as possible of the structure of the original hypergraph Idea: repeat coarsening recursively. Result: succession of smaller hypergraphs which approximate the original graph. Several methods exist. We use matching, which successively merges pairs of vertices Used in most graph partitioning methods: hmetis, Patoh, zoltan,.. Algorithm successively selects pairs of vertices to merge based on measure of similarity of the vertices. Kalamata, p. 60

61 Application: Multilevel Dimension Reduction Main Idea: coarsen to a certain level. Then use the resulting data set ˆX to find projector from R m to R d. This pro Graph Coarsen Coarsen n X m n Y d jector can be used to project the original data or any new data. Last Level m X r n r Project Yr n r d Main gain: Dimension reduction is done with a much smaller set. Hope: not much loss compared to using whole data Kalamata, p. 61

62 Application to text mining Recall common approach: 1. Scale data [e.g. TF-IDF scaling: ] 2. Perform a (partial) SVD on resulting matrix X U d Σ d V T d 3. Process query by same scaling (e.g. TF-IDF) 4. Compute similarities in d-dimensional space: s i = ˆq, ˆx i / ˆq ˆx i where [ˆx 1, ˆx 2,..., ˆx n ] = Vd T Rd n ; ˆq = Σ 1 d U T d q Rd Multilevel approach: replace SVD (or any other dim. reduction) by dimension reduction on coarse set. Only difference: TF-IDF done on the coarse set not original set. Kalamata, p. 62

63 Tests Three public data sets used for experiments: Medline, Cran and NPL (cs.cornell.edu) Coarsening to a max. of 4 levels. Data set Medline Cran NPL # documents # terms sparsity (%) 0.74% 1.41% 0.27% # queries avg. # rel./query Kalamata, p. 63

64 Results with NPL Statistics Level coarsen. # optimal optimal avg. time doc. # dim. precision #1 N/A % # % # % # % Kalamata, p. 64

65 Precision-Recall curves 0.5 NPL precision LSI multilevel LSI, r= multilevel LSI, r=3 multilevel LSI, r= recall Kalamata, p. 65

66 CPU times for preprocessing (Dim. reduction part) CPU time for truncated SVD (secs) LSI multilevel LSI, r=2 multilevel LSI, r=3 multilevel LSI, r=4 NPL # dimensions Kalamata, p. 66

67 Conclusion So how is this related to intitial title of efficient algorithms in data mining? Answer: All these eigenvalue problems are not cheap to solve.... and cost issue does not seem to bother practitioners too much for now.. Ingredients that will become mandatory: 1 Avoid the SVD 2 Fast algorithms that do not sacrifice quality. 3 In particullar: Multilevel approaches 4 Multilinear algebra [tensors] Kalamata, p. 67

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Université du Littoral- Calais July 11, 16 First..... to the