Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Size: px

Start display at page:

Download "Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University"

Mavis Briggs
5 years ago
Views:

If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to

1 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

2 Assumption:Data lies on or near a low d-dimensional subspace Axes of this subspace are effective representation of the data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2

3 Compress / reduce dimensionality: 10 6 rows; 10 3 columns; no updates Random access to any cell(s); small error: OK The above matrix is really 2-dimensional. All rows can be reconstructed by scaling [ ] or [ ] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3

4 Q:What is rankof a matrix A? A:Number of linearly independentcolumns of A For example: Matrix A = has rank r=2 Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3. Why do we care about low rank? We can write Aas two basis vectors: [1 2 1] [-2-3 1] And new coordinates of : [1 0] [0 1] [1 1] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 4

5 Cloud of points 3D space: Think of point positions as a matrix: 1 row per point: A B C A We can rewrite coordinates more efficiently! Old basis vectors:[1 00] [010][001] New basis vectors: [121][-2-31] Then Ahas new coordinates: [10]. B: [01], C: [11] Notice: We reduced the number of coordinates! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5

6 Goal of dimensionality reduction is to discover the axis of data! Rather than representing every point with 2 coordinates we represent each point with 1 coordinate (corresponding to the position of the point on the red line). By doing this we incur a bit of error as the points do not exactly lie on the line J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6

7 Why reduce dimensions? Discover hidden correlations/topics Words that occur commonly together Remove redundant and noisy features Not all words are useful Interpretation and visualization Easier storage and processing of the data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7

8 A [m x n] = U [m x r] Σ [ r x r] (V [n x r] ) T A: Input data matrix mx nmatrix (e.g., mdocuments, nterms) U: Left singular vectors mx rmatrix (mdocuments, rconcepts) Σ: Singular values rx rdiagonal matrix (strength of each concept ) (r: rank of the matrix A) V: Right singular vectors nx rmatrix (nterms, rconcepts) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8

9 T n n m A m Σ V T U J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9

10 T n σ 1 u 1 v 1 σ 2 u 2 v 2 m A + σ i scalar u i vector v i vector J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 10

11 It is alwayspossible to decompose a real matrix Ainto A = U ΣV T, where U, Σ, V: unique U, V: column orthonormal U T U = I; V T V = I (I: identity matrix) (Columns are orthogonal unit vectors) Σ: diagonal Entries (singular values) are positive, and sorted in decreasing order (σ 1 σ ) Nice proof of uniqueness: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 11

12 SciFi Romnce A = U ΣV T -example: Users to Movies Matrix Alien Serenity Casablanca Amelie = m U Σ n V T Concepts AKA Latent dimensions AKA Latent factors J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 12

13 SciFi Romnce A = U ΣV T -example: Users to Movies Matrix Alien Serenity Casablanca Amelie = x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 13 x

14 SciFi Romnce A = U ΣV T -example: Users to Movies Matrix Alien Serenity Casablanca Amelie = SciFi-concept Romance-concept x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 14 x

15 SciFi Romnce A = U ΣV T -example: Matrix Alien Serenity Casablanca Amelie = SciFi-concept Romance-concept Uis user-to-concept similarity matrix x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 15 x

16 SciFi Romnce A = U ΣV T -example: Matrix Alien Serenity Casablanca Amelie = SciFi-concept x strength of the SciFi-concept J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 16 x

17 SciFi Romnce A = U ΣV T -example: Matrix Alien Serenity Casablanca Amelie = SciFi-concept SciFi-concept V is movie-to-concept similarity matrix x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 17 x

18 movies, users and concepts : U: user-to-concept similarity matrix V: movie-to-concept similarity matrix Σ: its diagonal elements: strength of each concept J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 18

20 Movie 2 rating first right singular vector v 1 Movie 1 rating Instead of using two coordinates, to describe point locations, let s use only one coordinate Point s position is its location along vector How to choose? Minimize reconstruction error J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 20

21 Goal: Minimize the sum of reconstruction errors: where are the old and are the new coordinates SVD gives best axis to project on: best = minimizing the reconstruction errors In other words, minimum reconstruction error Movie 2 rating v 1 Movie 1 rating first right singular vector J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 21

22 A = U ΣV T -example: V: movie-to-concept matrix U: user-to-concept matrix Movie 2 rating v 1 first right singular vector Movie 1 rating = x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 22

23 A = U ΣV T -example: variance ( spread ) on the v 1 axis Movie 2 rating v 1 first right singular vector Movie 1 rating = x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 23

24 A = U ΣV T -example: U Σ: Gives the coordinates of the points in the projection axis Movie 2 rating v 1 first right singular vector Projection of users on the Sci-Fi axis (U Σ) Τ : Movie 1 rating J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 24

25 More details Q:How exactly is dim. reduction done? = x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 25

26 More details Q:How exactly is dim. reduction done? A: Set smallest singular values to zero = x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 26

27 More details Q:How exactly is dim. reduction done? A: Set smallest singular values to zero x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 27 x

28 More details Q:How exactly is dim. reduction done? A: Set smallest singular values to zero x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 28 x

29 More details Q:How exactly is dim. reduction done? A: Set smallest singular values to zero x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 29

30 More details Q:How exactly is dim. reduction done? A: Set smallest singular values to zero Frobenius norm: ǁMǁ F = Σ ij M ij 2 ǁA-Bǁ F = Σ ij (A ij -B ij ) 2 is small J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 30

31 Sigma A = U V T B is best approximation of A B = U Sigma V T J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 31

32 Theorem: LetA= U ΣV T andb= U SV T where S=diagonal rxrmatrixwith s i =σ i (i=1 k) else s i =0 then Bis abestrank(b)=kapprox. to A What do we mean by best : Bis a solution to min B ǁA-Bǁ F where rank(b)=k Σ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 32

33 Details! Theorem:LetA= U ΣV T (σ 1 σ 2, rank(a)=r) thenb= U SV T S=diagonal rxrmatrixwhere s i =σ i (i=1 k) else s i =0 is a best rank-kapproximation to A: Bis a solution to min B ǁA-Bǁ F where rank(b)=k We will need 2 facts: where M= PQRis SVD of M U ΣV T -U SV T = U(Σ-S) V T Σ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33

34 Details! We will need 2 facts: where M= PQRis SVD of M U ΣV T -U SV T = U(Σ-S) V T We apply: -- P column orthonormal -- R row orthonormal -- Q is diagonal J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34

35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 35 A= U ΣV T, B= U SV T (σ 1 σ 2 0, rank(a)=r) S=diagonal nxnmatrix where s i =σ i (i=1 k) else s i =0 then Bis solution to min B ǁA-Bǁ F, rank(b)=k Why? We want to choose s i to minimize Solution is to set s i =σ i (i=1 k) and other s i =0 + = + = = = + = r k i i r k i i k i i i s s i ) ( min σ σ σ = = = Σ = r i i i s F F k B rank B s S B A i 1 2 ) (, ) ( min min min σ We used: U Σ V T - U S V T = U (Σ - S) V T Details!

36 Equivalent: spectral decomposition of the matrix: = x x u 1 u 2 σ 1 σ 2 v 1 v 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 36

37 n Equivalent: spectral decomposition of the matrix m = σ 1 u 1 v T 1 + σ 2 u 2 v T n x 1 1 x m k terms Assume: σ 1 σ 2 σ Why is setting small σ i to 0 the right thing to do? Vectors u i and v i are unit length, so σ i scales them. So, zeroing small σ i introduces less error. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 37

38 n Q: How many σsto keep? A:Rule-of-a thumb: keep 80-90% of energy m = u 1 σ 1 v T 1 + σ 2 u 2 v T Assume: σ 1 σ 2 σ 3... J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 38

39 To compute SVD: O(nm 2 )or O(n 2 m)(whichever is less) But: Less work, if we just want singular values or if we want first ksingular vectors or if the matrix is sparse Implemented in linear algebra packages like LINPACK, Matlab, SPlus, Mathematica... J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 39

40 SVD:A= U ΣV T : unique U: user-to-concept similarities V: movie-to-concept similarities Σ : strength of each concept Dimensionality reduction: keep the few largest singular values (80-90% of energy ) SVD: picks up linear correlations J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 40

41 SVD gives us: A = U Σ V T Eigen-decomposition: A = X Λ X T A is symmetric U, V, X are orthonormal(u T U=I), Λ, Σare diagonal Now let s calculate: AA T = UΣ V T (UΣ V T ) T = UΣ V T (VΣ T U T ) = UΣΣ T U T A T A = V Σ T U T (UΣ V T ) = V ΣΣ T V T J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 41

42 SVD gives us: A = U Σ V T Eigen-decomposition: A = X Λ X T A is symmetric U, V, X are orthonormal(u T U=I), Λ, Σare diagonal Now let s calculate: AA T = UΣ V T (UΣ V T ) T = UΣ V T (VΣ T U T ) = UΣΣ T U T A T A = V Σ T U T (UΣ V T ) = V ΣΣ T V T X Λ 2 X T Shows how to compute SVD using eigenvalue decomposition! X Λ 2 X T J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 42

43 AA T = U Σ 2 U T A T A= V Σ 2 V T (A T A) k = V Σ 2k V T E.g.: (A T A) 2 = V Σ 2 V T V Σ 2 V T = V Σ 4 V T (A T A) k ~ v 1 σ 1 2k v 1 T for k>>1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 43

45 SciFi Romnce Q: Find users that like Matrix A: Map query into a concept space how? Matrix Alien Serenity Casablanca Amelie = x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 45 x

46 Q: Find users that like Matrix A: Map query into a concept space how? q= Matrix Alien Serenity Casablanca Amelie v2 Alien v1 q Project into concept space: Inner product with each concept vector v i Matrix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 46

47 Q: Find users that like Matrix A: Map query into a concept space how? q= Matrix Alien Serenity Casablanca Amelie v2 Alien v1 q q*v 1 Project into concept space: Inner product with each concept vector v i Matrix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 47

48 Compactly, we have: q concept = q V E.g.: q= Matrix Alien Serenity Casablanca Amelie SciFi-concept x = movie-to-concept similarities (V) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 48

49 How would the user dthat rated ( Alien, Serenity ) be handled? d concept = d V E.g.: q= Matrix Alien Serenity Casablanca Amelie SciFi-concept x = movie-to-concept similarities (V) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 49

50 Observation:User dthat rated ( Alien, Serenity ) will be similarto user qthat rated ( Matrix ), although dand qhave zero ratings in common! d = Matrix Alien Serenity Casablanca Amelie SciFi-concept q = Zero ratings in common Similarity 0 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 50

51 + Optimal low-rank approximation in terms of Frobenius norm - Interpretability problem: A singular vector specifies a linear combination of all input columns or rows - Lack of sparsity: Singular vectors are dense! = Σ V T U J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 51

53 Frobenius norm: ǁXǁ F = Σ ij X ij 2 Goal: Express A as a product of matrices C,U,R Make ǁA-C U Rǁ F small Constraints on C and R: A C U R J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 53

54 Frobenius norm: ǁXǁ F = Σ ij X ij 2 Goal: Express A as a product of matrices C,U,R Make ǁA-C U Rǁ F small Constraints on C and R: Pseudo-inverse of the intersection of C and R A C U R J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 54

55 Let: A k be the best rank kapproximation to A(that is, A k is SVD of A) Theorem[Drineas et al.] CURin O(m n) time achieves ǁA-CURǁ F ǁA-A k ǁ F + εǁaǁ F with probability at least 1-δ, by picking O(k log(1/δ)/ε 2 )columns, and O(k 2 log 3 (1/δ)/ε 6 )rows In practice: Pick 4k cols/rows J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 55

56 Sampling columns (similarly for rows): Note this is a randomized algorithm, same column can be sampled more than once J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 56

Let Wbe the intersection of sampled columns Cand rows R Let SVD of W =X ZY T Then:U= W + =YZ + X T Ζ + : reciprocals of non-zero singular values: Ζ + ii =1/ Ζ ii W + is the pseudoinverse A W C R U =

57 Let Wbe the intersection of sampled columns Cand rows R Let SVD of W =X ZY T Then:U= W + =YZ + X T Ζ + : reciprocals of non-zero singular values: Ζ + ii =1/ Ζ ii W + is the pseudoinverse A W C R U = W + Why pseudoinverse works? W = X Z Y then W -1 = X -1 Z -1 Y -1 Due to orthonomality X -1 =X T and Y -1 =Y T Since Z is diagonal Z -1 = 1/Z ii Thus, if W is nonsingular, pseudoinverse is the true inverse J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 57

58 For example: Select columns of A using ColumnSelect algorithm Select rows of A using ColumnSelect algorithm Set CUR error Then: with probability 98% SVD error In practice: Pick 4k cols/rows for a rank-k approximation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 58

59 + Easy interpretation Since the basis vectors are actual columns and rows + Sparse basis Since the basis vectors are actual columns and rows - Duplicate columns and rows Columns of large norms will be sampled many times Actual column Singular vector J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 59

60 If we want to get rid of the duplicates: Throw them away Scale (multiply) the columns/rows by the square root of the number of duplicates R d R s A C d C s Construct a small U J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 60

61 sparse and small SVD: A = U Σ V T Huge but sparse Big and dense dense but small CUR: A = C U R Huge but sparse Big but sparse J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 61

62 DBLP bibliographic data Author-to-conference big sparse matrix A ij : Number of papers published by author iat conference j 428K authors (rows), 3659 conferences (columns) Very sparse Want to reduce dimensionality How much time does it take? What is the reconstruction error? How much space do we need? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 62

63 SVD SVD CUR CUR CMDno duplicates 10 3 SVD CUR CMD CUR no dup space ratio accuracy CUR Accuracy: 1 relative sum squared errors Space ratio: #output matrix entries / #input matrix entries CPU time time(sec) accuracy Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 07. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 63

64 SVD is limited to linear projections: Lower-dimensional linear projection that preserves Euclidean distances Non-linear methods: Isomap Data lies on a nonlinear low-dim curve aka manifold Use the distance as measured along the manifold How? Build adjacency graph Geodesic distance is graph distance SVD/PCA the graph pairwise distance matrix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 64

65 Drineaset al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 2007 Intra-and interpopulationgenotype reconstruction from tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), (2007) Tensor-CUR Decompositions For Tensor-Based Data, M. W. Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, (2006) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 65

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a