CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism

Size: px
Start display at page:

Download "CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism"

Transcription

1 Talk 3: Graph Mining Tools Tensors, communities, parallelism Christos Faloutsos CMU

2 Overall Outline Introduction Motivation Talk#1: Patterns in graphs; generators Talk#2: Tools (Ranking, proximity) Talk#3: Tools (Tensors, scalability) Conclusions Lipari 2010 (C) 2010, C. Faloutsos 2

3 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 3

4 Tamara Kolda (Sandia) Thanks to for the foils on tensor definitions, and on TOPHITS Lipari 2010 (C) 2010, C. Faloutsos 4

5 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 5

6 Examples of Matrices: Authors and terms John Peter Mary Nick... data mining classif. tree... Lipari 2010 (C) 2010, C. Faloutsos 6

7 Motivation: Why tensors? Q: what is a tensor? Lipari 2010 (C) 2010, C. Faloutsos 7

8 Motivation: Why tensors? A: N-D generalization of matrix: KDD 09 John Peter Mary Nick... data mining classif. tree... Lipari 2010 (C) 2010, C. Faloutsos 8

9 Motivation: Why tensors? A: N-D generalization of matrix: KDD 07 KDD 08 KDD 09 data mining classif. tree... John Peter Mary Nick... Lipari 2010 (C) 2010, C. Faloutsos 9

10 Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree... Mode#2 Mode (== aspect) #1 Lipari 2010 (C) 2010, C. Faloutsos 10

11 Notice 3 rd mode does not need to be time we can have more than 3 modes Dest. port IP source IP destination Lipari 2010 (C) 2010, C. Faloutsos 11

12 Notice 3 rd mode does not need to be time we can have more than 3 modes Eg, ffmri: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +) Lipari 2010 (C) 2010, C. Faloutsos 12

13 Motivating Applications Why tensors are useful? web mining (TOPHITS) environmental sensors Intrusion detection (src, dst, time, dest-port) Social networks (src, dst, time, type-of-contact) face recognition etc Lipari 2010 (C) 2010, C. Faloutsos 13

14 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 14

15 Tensor basics Multi-mode extensions of SVD recall that: Lipari 2010 (C) 2010, C. Faloutsos 15

16 Reminder: SVD n n m A m Σ V T U Best rank-k approximation in L2 Lipari 2010 (C) 2010, C. Faloutsos 16

17 Reminder: SVD n σ 1 u 1 v 1 σ 2 u 2 v 2 m A + Best rank-k approximation in L2 Lipari 2010 (C) 2010, C. Faloutsos 17

18 Goal: extension to >=3 modes I x J x K I x R J x R ~ A R x R x R B = + + Lipari 2010 (C) 2010, C. Faloutsos 18

19 Main points: 2 major types of tensor decompositions: PARAFAC and Tucker both can be solved with ``alternating least squares (ALS) Lipari 2010 (C) 2010, C. Faloutsos 19

20 Specially Structured Tensors Tucker Tensor Kruskal Tensor Our Notation Our Notation core I x J x K I x R J x S I x J x K w 1 I x R w R J x R = U R x S x T V = = u 1 v 1 U + + V R x R x R u R v R Lipari 2010 (C) 2010, C. Faloutsos 20

21 Tucker Decomposition - intuition I x J x K I x R J x S ~ A R x S x T B author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G: how groups relate to each other Lipari 2010 (C) 2010, C. Faloutsos 21

22 Intuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Coclustering, KDD 03] Lipari 2010 (C) 2010, C. Faloutsos 22

23 n m eg, terms x documents m k k l l n Lipari 2010 (C) 2010, C. Faloutsos 23

24 med. doc cs doc med. terms term group x doc. group cs terms common terms term x term-group doc x doc group Lipari 2010 (C) 2010, C. Faloutsos 24

25 Two main tools PARAFAC Tucker Tensor tools - summary Both find row-, column-, tube-groups but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares ) Lipari 2010 (C) 2010, C. Faloutsos 25

26 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 26

27 Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) Lipari 2010 (C) 2010, C. Faloutsos 27

28 Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic authority scores for 2 nd topic to from hub scores for 1 st topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 28 Kleinberg, JACM, 1999

29 HITS Authorities on Sample Data We started our crawl from and crawled 4700 pages, resulting in 560 cross-linked hosts. authority scores for 1 st topic authority scores for 2 nd topic to from hub scores Lipari 2010 hub scores (C) 2010, C. Faloutsos for 1 st topic 29 for 2 nd topic

30 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 30 Kolda, Bader, Kenny, ICDM05

31 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 31 Kolda, Bader, Kenny, ICDM05

32 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 32 Kolda, Bader, Kenny, ICDM05

33 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 33

34 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 34

35 TOPHITS Terms & Authorities on Sample Data TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms. w k = # unique links using term k Tensor PARAFAC term scores for 1 st topic term scores for 2 nd topic to from authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 Lipari 2010 hub scores nd topic (C) 2010, C. Faloutsos for 1 35 st topic

36 Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms PARAFAC and Tucker: discover groups Lipari 2010 (C) 2010, C. Faloutsos 36

37 References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages , November Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006 Lipari 2010 (C) 2010, C. Faloutsos 37

38 Resources See tutorial on tensors, KDD 07 (w/ Tamara Kolda and Jimeng Sun): Lipari 2010 (C) 2010, C. Faloutsos 38

39 Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/tensortoolbox T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/tensorreview-preprint.pdf T. Kolda Lipari ICDE and J. Sun: Scalable Copyright: (C) Tensor 2010, Faloutsos, C. Faloutsos Tong Decomposition (2009) for Multi-Aspect 2-39 Data Mining (ICDM 2008)

40 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 40

41 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations Lipari 2010 (C) 2010, C. Faloutsos 41

42 Given a graph, and k Problem Break it into k (disjoint) communities Lipari 2010 (C) 2010, C. Faloutsos 42

43 Given a graph, and k Problem Break it into k (disjoint) communities k = 2 Lipari 2010 (C) 2010, C. Faloutsos 43

44 Solution #1: METIS Arguably, the best algorithm Open source, at and *many* related papers, at same url Main idea: coarsen the graph; partition; un-coarsen Lipari 2010 (C) 2010, C. Faloutsos 44

45 Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, <and many extensions> Lipari 2010 (C) 2010, C. Faloutsos 45

46 Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: Consider the 2 nd smallest eigenvector of the (normalized) Laplacian Lipari 2010 (C) 2010, C. Faloutsos 46

47 Many more ideas: Solutions #3, Clustering on the A 2 (square of adjacency matrix) [Zhou, Woodruff, PODS 04] Minimum cut / maximum flow [Flake+, KDD 00] Lipari 2010 (C) 2010, C. Faloutsos 47

48 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 48

49 Problem definition Given a bi-partite graph, and k, l Divide it into k row groups and l row groups (Also applicable to uni-partite graph) Lipari 2010 (C) 2010, C. Faloutsos 49

50 Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously Cluster rows into k disjoint groups Cluster columns into l disjoint groups Lipari 2010 (C) 2010, C. Faloutsos 50

51 Co-clustering Let X and Y be discrete random variables X and Y take values in {1, 2,, m} and {1, 2,, n} p(x, Y) denotes the joint probability distribution if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Reference: 1. Dhillon et al. Information-Theoretic Co-clustering, KDD 03 Lipari 2010 Copyright: (C) 2010, Faloutsos, C. Faloutsos Tong (2009) 2-51

52 n m eg, terms x documents m k k l l n Lipari 2010 (C) 2010, C. Faloutsos 52

53 med. doc cs doc med. terms term group x doc. group cs terms common terms doc x doc group term x term-group Lipari 2010 (C) 2010, C. Faloutsos 53

54 Observations Co-clustering uses KL divergence, instead of L2 the middle matrix is not diagonal we saw that earlier in the Tucker tensor decomposition s/w at: Lipari 2010 (C) 2010, C. Faloutsos 54

55 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 55

56 Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups " Fully Automatic: No magic numbers Scalable to large graphs Lipari 2010 (C) 2010, C. Faloutsos 56

57 Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No magic numbers Scalable to large matrices Reference: 1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD 04 Lipari 2010 (C) 2010, C. Faloutsos 57

58 What makes a cross-association good? Row groups versus Row groups Why is this better? Column groups Column groups Lipari 2010 (C) 2010, C. Faloutsos 58

59 What makes a cross-association good? Row groups versus Row groups Why is this better? Column groups Column groups simpler; easier to describe easier to compress! Lipari 2010 (C) 2010, C. Faloutsos 59

60 What makes a cross-association good? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression Lipari 2010 (C) 2010, C. Faloutsos 60

61 Main Idea Good Compression Better Clustering Total Encoding Cost = Σ i size i * H(x i ) + Cost of describing cross-associations Code Cost Description Cost Minimize the total cost (# bits) for lossless compression Lipari 2010 (C) 2010, C. Faloutsos 61

62 Algorithm l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Lipari 2010 (C) 2010, C. Faloutsos 62

63 Experiments CLASSIC Documents 3,893 documents 4,303 words 176,347 dots Words Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Lipari 2010 (C) 2010, C. Faloutsos 63

64 Experiments Documents Words CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 64

65 Experiments insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient MEDLINE (medical) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 65

66 Experiments providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies MEDLINE (medical) CISI (Information Retrieval) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 66

67 Experiments shape, nasa, leading, assumed, thin MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 67

68 Experiments paint, examination, fall, raise, leave, based MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 68

69 Algorithm Code for cross-associations (matlab): CrossAssociations tgz! Variations and extensions: Autopart [Chakrabarti, PKDD 04] Lipari 2010 (C) 2010, C. Faloutsos 69

70 Algorithm Hadoop implementation [ICDM 08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map- Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: Lipari 2010 (C) 2010, C. Faloutsos 70

71 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations Lipari 2010 (C) 2010, C. Faloutsos 71

72 Observation #1 Skewed degree distributions there are nodes with huge degree (>O(10^4), in facebook/linkedin popularity contests!) Lipari 2010 (C) 2010, C. Faloutsos 72

73 Observation #2 Maybe there are no good cuts: ``jellyfish shape [Tauro+ 01], [Siganos+, 06], strange behavior of cuts [Chakrabarti+ 04], [Leskovec+, 08] Lipari 2010 (C) 2010, C. Faloutsos 73

74 Observation #2 Maybe there are no good cuts: ``jellyfish shape [Tauro+ 01], [Siganos+, 06], strange behavior of cuts [Chakrabarti+, 04], [Leskovec+, 08]?? Lipari 2010 (C) 2010, C. Faloutsos 74

75 Jellyfish model [Tauro+] A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, Lipari M. Faloutsos, 2010 J. of Communications (C) 2010, C. Faloutsos and Networks, Vol. 8, No. 3, 75 pp , Sept

76 Strange behavior of min cuts negative dimensionality (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Lipari 2010 Networks, J. Leskovec, (C) 2010, C. K. Faloutsos Lang, A. Dasgupta, M. Mahoney. 76 WWW 2008.

77 Min-cut plot Do min-cuts recursively. log (mincut-size / #edges) Mincut size = sqrt(n) log (# edges) N nodes Lipari 2010 (C) 2010, C. Faloutsos 77

78 Min-cut plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) log (# edges) N nodes Lipari 2010 (C) 2010, C. Faloutsos 78

79 Min-cut plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) Slope = -0.5 N nodes log (# edges) For a d-dimensional grid, the slope is -1/d Lipari 2010 (C) 2010, C. Faloutsos 79

80 log (mincut-size / #edges) Min-cut plot log (mincut-size / #edges) Slope = -1/d log (# edges) For a d-dimensional grid, the slope is -1/d log (# edges) For a random graph, the slope is 0 Lipari 2010 (C) 2010, C. Faloutsos 80

81 Min-cut plot What does it look like for a real-world graph? log (mincut-size / #edges)? log (# edges) Lipari 2010 (C) 2010, C. Faloutsos 81

82 Datasets: Experiments Google Web Graph: 916,428 nodes and 5,105,039 edges Lucent Router Graph: Undirected graph of network routers from 112,969 nodes and 181,639 edges User Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, Lipari D Blandford, C. Faloutsos (C) 2010, C. and Faloutsos G. Blelloch, in the SDM Workshop on Link Analysis, Counter-terrorism and Privacy

83 Experiments Used the METIS algorithm [Karypis, Kumar, 1995] log (mincut-size / #edges) Slope~ -0.4 log (# edges) Google Web graph Values along the y- axis are averaged We observe a lip for large edges Slope of -0.4, corresponds to a 2.5- dimensional grid! Lipari 2010 (C) 2010, C. Faloutsos 83

84 Google graph log (mincut-size / #edges) Log(#edges) All min-cuts Log(#edges) averaged Lipari 2010 (C) 2010, C. Faloutsos 84

85 Experiments Same results for other graphs too Lucent Router graph Clickstream graph Lipari 2010 (C) 2010, C. Faloutsos 85

86 Conclusions Practitioner s guide Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations METIS Co-clustering Cross-associations jellyfish : Maybe, there are no good cuts Lipari 2010 (C) 2010, C. Faloutsos 86

87 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 87

88 Epidemic threshold Problem definition Analysis Experiments Detailed outline Fraud detection in e-bay Lipari 2010 (C) 2010, C. Faloutsos 88

89 Virus propagation How do viruses/rumors propagate? Blog influence? Will a flu-like virus linger, or will it become extinct soon? Lipari 2010 (C) 2010, C. Faloutsos 89

90 The model: SIS Flu like: Susceptible-Infected-Susceptible Virus strength s= β/δ Prob. β Prob. δ N2 Healthy N1 N Infected N3 Lipari 2010 (C) 2010, C. Faloutsos 90

91 Epidemic threshold τ of a graph: the value of τ, such that if strength s = β / δ < τ an epidemic can not happen Thus, given a graph compute its epidemic threshold Lipari 2010 (C) 2010, C. Faloutsos 91

92 Epidemic threshold τ What should τ depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree? and/or diameter? Lipari 2010 (C) 2010, C. Faloutsos 92

93 Epidemic threshold [Theorem 1] We have no epidemic, if β/δ <τ = 1/ λ 1,A Lipari 2010 (C) 2010, C. Faloutsos 93

94 Epidemic threshold [Theorem 1] We have no epidemic (*), if recovery prob. attack prob. Proof: [Wang+03] epidemic threshold β/δ <τ = 1/ λ 1,A largest eigenvalue of adj. matrix A (*) under mild, conditional-independence assumptions Lipari 2010 (C) 2010, C. Faloutsos 94

95 t+1: - ( healthy or healed ) - and not t Beginning of proof Let: p(i, t) = Prob node i is t p(i, t+1 ) = (1 p(i, t) + p(i, t) * δ ) * Π j (1 β aji * p(j, t) ) Below threshold, if the above non-linear dynamical system above is stable (eigenvalue of Hessian < 1 ) Lipari 2010 (C) 2010, C. Faloutsos 95

96 Epidemic threshold for various networks Formula includes older results as special cases: Homogeneous networks [Kephart+White] λ 1,A = <k>; τ = 1/<k> (<k> : avg degree) Star networks (d = degree of center) λ 1,A = sqrt(d); τ = 1/ sqrt(d) Infinite power-law networks λ 1,A = ; τ = 0 ; [Barabasi] Lipari 2010 (C) 2010, C. Faloutsos 96

97 Epidemic threshold [Theorem 2] Below the epidemic threshold, the epidemic dies out exponentially Lipari 2010 (C) 2010, C. Faloutsos 97

98 Epidemic threshold Problem definition Analysis Experiments Detailed outline Lipari 2010 (C) 2010, C. Faloutsos 98

99 Experiments (Oregon) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) Lipari 2010 (C) 2010, C. Faloutsos 99

100 SIS simulation - # infected nodes vs time Log - Lin #inf. (log scale) above at below Time (linear scale) Lipari 2010 (C) 2010, C. Faloutsos 100

101 SIS simulation - # infected nodes vs time Log - Lin #inf. (log scale) above at below Exponential decay Time (linear scale) Lipari 2010 (C) 2010, C. Faloutsos 101

102 SIS simulation - # infected nodes vs time Log - Log #inf. (log scale) above at below Time (log scale) Lipari 2010 (C) 2010, C. Faloutsos 102

103 SIS simulation - # infected nodes vs time Log - Log #inf. (log scale) Power-law Decay (!) above at below Time (log scale) Lipari 2010 (C) 2010, C. Faloutsos 103

104 Conclusions λ 1,A : Eigenvalue of adjacency matrix determines the survival of a flu-like virus It gives a measure of how well connected is the graph (~ # paths) May guide immunization policies Lipari 2010 (C) 2010, C. Faloutsos 104

105 References D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008 Ganesh, A., Massoulie, L., and Towsley, D., The effect of network topology on the spread of epidemics. In INFOCOM. Lipari 2010 (C) 2010, C. Faloutsos 105

106 References (cont d) Hethcote, H. W The mathematics of infectious diseases. SIAM Review 42, Hethcote, H. W. AND Yorke, J. A Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics. Lipari 2010 (C) 2010, C. Faloutsos 106

107 References (cont d) Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy Lipari 2010 (C) 2010, C. Faloutsos 107

108 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 108

109 Scalability How about if graph/tensor does not fit in core? How about handling huge graphs? Lipari 2010 (C) 2010, C. Faloutsos P8-109

110 Scalability How about if graph/tensor does not fit in core? [ MET : Kolda, Sun, ICMD 08, best paper award] How about handling huge graphs? Lipari 2010 (C) 2010, C. Faloutsos P8-110

111 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? Lipari 2010 (C) 2010, C. Faloutsos P8-111

112 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce hadoop (open-source clone) Lipari 2010 (C) 2010, C. Faloutsos P8-112

113 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id Lipari 2010 (C) 2010, C. Faloutsos P8-113

114 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id reduce map Lipari 2010 (C) 2010, C. Faloutsos P8-114

115 User Program fork fork fork Input Data (on HDFS) Split 0 Split 1 Split 2 read Mapper Mapper Mapper assign map local write Master assign reduce remote read, sort Reducer Reducer write Output File 0 Output File 1 By default: 3-way replication; Late/dead machines: ignored, transparently (!) Lipari 2010 (C) 2010, C. Faloutsos P8-115

116 D.I.S.C. Data Intensive Scientific Computing [R. Bryant, CMU] big data Lipari 2010 (C) 2010, C. Faloutsos P8-116

117 Analysis of a large graph ~200Gb (Yahoo crawl) - Degree Distribution: in 12 minutes with 50 machines Many (link spams?) at out-degree 1200 Lipari 2010 (C) 2010, C. Faloutsos P8-117

118 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 118

119 HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM 10 Naively: diameter needs O(N**2) space and up to O(N**3) time prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster Lipari 2010 (C) 2010, C. Faloutsos 119

120 Count?????? 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. Lipari 2010 (C) 2010, C. Faloutsos 120

121 Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. Lipari 2010 (C) 2010, C. Faloutsos 121

122 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. Lipari 2010 (C) 2010, C. Faloutsos 122

123 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. Lipari 2010 (C) 2010, C. Faloutsos 123

124 Radius Plot of GCC of YahooWeb. Lipari 2010 (C) 2010, C. Faloutsos 124

125 details Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

126 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 126

127 Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). Lipari 2010 (C) 2010, C. Faloutsos 127

128 Generalized Iterated Matrix Vector Multiplication (GIMV) details PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. ) Matrix vector Multiplication (iterated) Lipari 2010 (C) 2010, C. Faloutsos 128

129 Example: GIM-V At Work Connected Components Count Lipari 2010 Size (C) 2010, C. Faloutsos 129

130 Example: GIM-V At Work Connected Components Count ~0.7B singleton nodes Lipari 2010 Size (C) 2010, C. Faloutsos 130

131 Example: GIM-V At Work Connected Components Count Lipari 2010 Size (C) 2010, C. Faloutsos 131

132 Example: GIM-V At Work Connected Components Count 300-size cmpt X size cmpt Why? X 65. Why? Lipari 2010 Size (C) 2010, C. Faloutsos 132

133 Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Lipari 2010 Size (C) 2010, C. Faloutsos 133

134 GIM-V At Work Connected Components over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point Lipari 2010 (C) 2010, C. Faloutsos 134

135 Conclusions Hadoop: promising architecture for Tera/ Peta scale graph mining Resources: Higher-level language for data processing Lipari 2010 (C) 2010, C. Faloutsos P8-135

136 References Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: Lipari 2010 (C) 2010, C. Faloutsos P8-136

137 Overall Conclusions Real graphs exhibit surprising patterns (power laws, shrinking diameter etc) SVD: a powerful tool (HITS, PageRank) Several other tools: tensors, METIS, Scalability: hadoop/parallelism Lipari 2010 (C) 2010, C. Faloutsos 137

138 Our goal: Open source system for mining huge graphs: PEGASUS project (PEta GrAph mining System) code and papers Lipari 2010 (C) 2010, C. Faloutsos 138

139 Project info Chau, Polo McGlohon, Mary Tsourakakis, Babis Akoglu, Leman Kang, U Prakash, Aditya Tong, Hanghang Thanks to: NSF IIS , IIS , CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Lipari 2010 INTEL, HP (C) 2010, C. Faloutsos 139

140 Extra material E-bay fraud detection Outlier detection Lipari 2010 (C) 2010, C. Faloutsos 140

141 Detailed outline Fraud detection in e-bay Anomaly detection Lipari 2010 (C) 2010, C. Faloutsos 141

142 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Lipari 2010 (C) 2010, C. Faloutsos 142 Faloutsos (WWW'07), pp

143 E-bay Fraud detection lines: positive feedbacks would you buy from him/her? Lipari 2010 (C) 2010, C. Faloutsos 143

144 E-bay Fraud detection lines: positive feedbacks would you buy from him/her? or him/her? Lipari 2010 (C) 2010, C. Faloutsos 144

145 E-bay Fraud detection - NetProbe Belief Propagation gives: Lipari 2010 (C) 2010, C. Faloutsos 145

146 Extra material E-bay fraud detection Outlier detection Lipari 2010 (C) 2010, C. Faloutsos 146

147 OddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science PAKDD 2010, Hyderabad, India

148 For each node, Main idea extract ego-net (=1-step-away neighbors) Extract features (#edges, total weight, etc etc) Compare with the rest of the population Lipari 2010 (C) 2010, C. Faloutsos 148

149 What is an egonet? ego egonet Lipari 2010 (C) 2010, C. Faloutsos 149

150 Selected Features N i : number of neighbors (degree) of ego i E i : number of edges in egonet i W i : total weight of egonet i λ w,i : principal eigenvalue of the weighted adjacency matrix of egonet I Lipari 2010 (C) 2010, C. Faloutsos 150

151 Near-Clique/Star Lipari 2010 (C) 2010, C. Faloutsos 151

152 Near-Clique/Star Lipari 2010 (C) 2010, C. Faloutsos 152

153 END Arrivederci Lipari 2010 (C) 2010, C. Faloutsos 153

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos Must-read Material 15-826: Multimedia Databases and Data Mining Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories,

More information

Must-read material (1 of 2) : Multimedia Databases and Data Mining. Must-read material (2 of 2) Main outline

Must-read material (1 of 2) : Multimedia Databases and Data Mining. Must-read material (2 of 2) Main outline Must-read material (1 of 2) 15-826: Multimedia Databases and Data Mining Lecture #29: Graph mining - Generators & tools Fully Automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha

More information

Large Graph Mining: Power Tools and a Practitioner s guide

Large Graph Mining: Power Tools and a Practitioner s guide Large Graph Mining: Power Tools and a Practitioner s guide Task 5: Graphs over time & tensors Faloutsos, Miller, Tsourakakis CMU KDD '9 Faloutsos, Miller, Tsourakakis P5-1 Outline Introduction Motivation

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools

CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU (on sabbatical at google) Tamara Kolda Thank you! C. Faloutsos (CMU + Google) 2 Our goal: Open source system for mining

More information

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,

More information

CMU SCS. Large Graph Mining. Christos Faloutsos CMU

CMU SCS. Large Graph Mining. Christos Faloutsos CMU Large Graph Mining Christos Faloutsos CMU Thank you! Hillol Kargupta NGDM 2007 C. Faloutsos 2 Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece graphs; fraud

More information

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams Jimeng Sun Spiros Papadimitriou Philip S. Yu Carnegie Mellon University Pittsburgh, PA, USA IBM T.J. Watson Research Center Hawthorne,

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu Mary McGlohon Christos Faloutsos Carnegie Mellon University, School of Computer Science {lakoglu, mmcgloho, christos}@cs.cmu.edu

More information

Data mining in large graphs

Data mining in large graphs Data mining in large graphs Christos Faloutsos University www.cs.cmu.edu/~christos ALLADIN 2003 C. Faloutsos 1 Outline Introduction - motivation Patterns & Power laws Scalability & Fast algorithms Fractals,

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 2. Basic Linear Algebra continued Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

More information

Anomaly Detection. Davide Mottin, Konstantina Lazaridou. HassoPlattner Institute. Graph Mining course Winter Semester 2016

Anomaly Detection. Davide Mottin, Konstantina Lazaridou. HassoPlattner Institute. Graph Mining course Winter Semester 2016 Anomaly Detection Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016 and Next week February 7, third round presentations Slides are due by February 6,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com

More information

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity CS322 Project Writeup Semih Salihoglu Stanford University 353 Serra Street Stanford, CA semih@stanford.edu

More information

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to ge

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to ge Graph Mining: Laws, Generators and Tools Christos Faloutsos CMU Thank you! Sharad Mehrotra UCI 2007 C. Faloutsos 2 Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece

More information

Toward Understanding Spatial Dependence on Epidemic Thresholds in Networks

Toward Understanding Spatial Dependence on Epidemic Thresholds in Networks Toward Understanding Spatial Dependence on Epidemic Thresholds in Networks Zesheng Chen Department of Computer Science Indiana University - Purdue University Fort Wayne, Indiana 4685 Email: chenz@ipfw.edu

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu March 16, 2016 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #8: Epidemics: Thresholds

CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #8: Epidemics: Thresholds CS 6604: Data Mining Large Networks and Time-series B. Aditya Prakash Lecture #8: Epidemics: Thresholds A fundamental ques@on Strong Virus Epidemic? 2 example (sta@c graph) Weak Virus Epidemic? 3 Problem

More information

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models

More information

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search 6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search)

More information

CMU SCS Mining Large Graphs: Patterns, Anomalies, and Fraud Detection

CMU SCS Mining Large Graphs: Patterns, Anomalies, and Fraud Detection Mining Large Graphs: Patterns, Anomalies, and Fraud Detection Christos Faloutsos CMU Thank you! Prof. Julian McAuley Nicholas Urioste UCSD'16 (c) 2016, C. Faloutsos 2 Roadmap Introduction Motivation Why

More information

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS DATA MINING LECTURE 3 Link Analysis Ranking PageRank -- Random walks HITS How to organize the web First try: Manually curated Web Directories How to organize the web Second try: Web Search Information

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University Announcements: Please fill HW Survey Weekend Office Hours starting this weekend (Hangout only) Proposal: Can use 1 late period CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart

More information

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University. Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering

More information

Analytically tractable processes on networks

Analytically tractable processes on networks University of California San Diego CERTH, 25 May 2011 Outline Motivation 1 Motivation Networks Random walk and Consensus Epidemic models Spreading processes on networks 2 Networks Motivation Networks Random

More information

CS224W: Methods of Parallelized Kronecker Graph Generation

CS224W: Methods of Parallelized Kronecker Graph Generation CS224W: Methods of Parallelized Kronecker Graph Generation Sean Choi, Group 35 December 10th, 2012 1 Introduction The question of generating realistic graphs has always been a topic of huge interests.

More information

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Large Graph Mining: Power Tools and a Practitioner s guide

Large Graph Mining: Power Tools and a Practitioner s guide Large Graph Mining: Power Tools and a Practitioner s guide Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU KDD'09 Faloutsos, Miller, Tsourakakis P3-1 Outline Introduction Motivation

More information

Talk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU

Talk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU Talk 2: Graph Mining Tools - SVD, ranking, proximity Christos Faloutsos CMU Outline Introduction Motivation Task 1: Node importance Task 2: Recommendations Task 3: Connection sub-graphs Conclusions Lipari

More information

Link Analysis Ranking

Link Analysis Ranking Link Analysis Ranking How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it? Naïve ranking of query results Given query

More information

Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach

Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach Anna

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Task: Find coalitions in signed networks Incentives: European

More information

High Performance Parallel Tucker Decomposition of Sparse Tensors

High Performance Parallel Tucker Decomposition of Sparse Tensors High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,

More information

A new centrality measure for probabilistic diffusion in network

A new centrality measure for probabilistic diffusion in network ACSIJ Advances in Computer Science: an International Journal, Vol. 3, Issue 5, No., September 204 ISSN : 2322-557 A new centrality measure for probabilistic diffusion in network Kiyotaka Ide, Akira Namatame,

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Submitted for Blind Review Abstract How do real, weighted graphs change over time? What patterns, if any, do they obey? Earlier studies

More information

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Miguel Araujo CMU/University of Porto maraujo@cs.cmu.edu Christos Faloutsos* Carnegie Mellon University christos@cs.cmu.edu Spiros Papadimitriou

More information

Rainfall data analysis and storm prediction system

Rainfall data analysis and storm prediction system Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited

More information

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds MAE 298, Lecture 8 Feb 4, 2008 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in a file-sharing

More information

Graph Analysis Using Map/Reduce

Graph Analysis Using Map/Reduce Seminar: Massive-Scale Graph Analysis Summer Semester 2015 Graph Analysis Using Map/Reduce Laurent Linden s9lalind@stud.uni-saarland.de May 14, 2015 Introduction Peta-Scale Graph + Graph mining + MapReduce

More information

Lecture VI Introduction to complex networks. Santo Fortunato

Lecture VI Introduction to complex networks. Santo Fortunato Lecture VI Introduction to complex networks Santo Fortunato Plan of the course I. Networks: definitions, characteristics, basic concepts in graph theory II. III. IV. Real world networks: basic properties

More information

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.

More information

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data Arvind Ramanathan ramanathana@ornl.gov Computational Data Analytics Group, Computer Science & Engineering

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Graph and Network Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Methods Learnt Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

CMU SCS : Multimedia Databases and Data Mining CMU SCS Must-read Material CMU SCS Must-read Material (cont d)

CMU SCS : Multimedia Databases and Data Mining CMU SCS Must-read Material CMU SCS Must-read Material (cont d) 15-826: Multimedia Databases and Data Mining Lecture #26: Graph mining - patterns Christos Faloutsos Must-read Material Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, On Power-Law Relationships

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Self Similar (Scale Free, Power Law) Networks (I)

Self Similar (Scale Free, Power Law) Networks (I) Self Similar (Scale Free, Power Law) Networks (I) E6083: lecture 4 Prof. Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu February 7, 2007

More information

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Miguel Araujo 1,5, Spiros Papadimitriou 2, Stephan Günnemann 1, Christos Faloutsos 1, Prithwish Basu 3, Ananthram Swami 4, Evangelos E.

More information

HEigen: Spectral Analysis for Billion-Scale Graphs

HEigen: Spectral Analysis for Billion-Scale Graphs IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 2012 1 HEigen: Spectral Analysis for Billion-Scale Graphs U Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos

More information

Fitting a Tensor Decomposition is a Nonlinear Optimization Problem

Fitting a Tensor Decomposition is a Nonlinear Optimization Problem Fitting a Tensor Decomposition is a Nonlinear Optimization Problem Evrim Acar, Daniel M. Dunlavy, and Tamara G. Kolda* Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia

More information

Problem (INFORMAL). Given a dynamic graph, find a set of possibly overlapping temporal subgraphs to concisely describe the given dynamic graph in a

Problem (INFORMAL). Given a dynamic graph, find a set of possibly overlapping temporal subgraphs to concisely describe the given dynamic graph in a Outlines TimeCrunch: Interpretable Dynamic Graph Summarization by Neil Shah et. al. (KDD 2015) From Micro to Macro: Uncovering and Predicting Information Cascading Process with Behavioral Dynamics by Linyun

More information

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications

More information

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Xutao Li 1 Michael Ng 2 Yunming Ye 1 1 Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology,

More information

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Link Analysis Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages

More information

Big Tensor Data Reduction

Big Tensor Data Reduction Big Tensor Data Reduction Nikos Sidiropoulos Dept. ECE University of Minnesota NSF/ECCS Big Data, 3/21/2013 Nikos Sidiropoulos Dept. ECE University of Minnesota () Big Tensor Data Reduction NSF/ECCS Big

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor

More information

Using R for Iterative and Incremental Processing

Using R for Iterative and Incremental Processing Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C. Outline 15-826: Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos Goal: Find similar / interesting things Intro to DB Indexing - similarity search Points Text Time sequences; images

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into

More information

1 Searching the World Wide Web

1 Searching the World Wide Web Hubs and Authorities in a Hyperlinked Environment 1 Searching the World Wide Web Because diverse users each modify the link structure of the WWW within a relatively small scope by creating web-pages on

More information

Compressing Tabular Data via Pairwise Dependencies

Compressing Tabular Data via Pairwise Dependencies Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research TCE Conference, June 22, 2017 Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford) Huge datasets: everywhere - Internet

More information

Lecture 7 Mathematics behind Internet Search

Lecture 7 Mathematics behind Internet Search CCST907 Hidden Order in Daily Life: A Mathematical Perspective Lecture 7 Mathematics behind Internet Search Dr. S. P. Yung (907A) Dr. Z. Hua (907B) Department of Mathematics, HKU Outline Google is the

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data

More information

Tensor Analysis. Topics in Data Mining Fall Bruno Ribeiro

Tensor Analysis. Topics in Data Mining Fall Bruno Ribeiro Tensor Analysis Topics in Data Mining Fall 2015 Bruno Ribeiro Tensor Basics But First 2 Mining Matrices 3 Singular Value Decomposition (SVD) } X(i,j) = value of user i for property j i 2 j 5 X(Alice, cholesterol)

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

Lecture 12: Link Analysis for Web Retrieval

Lecture 12: Link Analysis for Web Retrieval Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities

More information

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation Graphs and Networks Page 1 Lecture 2, Ranking 1 Tuesday, September 12, 2006 1:14 PM I. II. I. How search engines work: a. Crawl the web, creating a database b. Answer query somehow, e.g. grep. (ex. Funk

More information

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to

More information

A New Space for Comparing Graphs

A New Space for Comparing Graphs A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main

More information

Algebraic Representation of Networks

Algebraic Representation of Networks Algebraic Representation of Networks 0 1 2 1 1 0 0 1 2 0 0 1 1 1 1 1 Hiroki Sayama sayama@binghamton.edu Describing networks with matrices (1) Adjacency matrix A matrix with rows and columns labeled by

More information

Spectral Graph Theory Tools. Analysis of Complex Networks

Spectral Graph Theory Tools. Analysis of Complex Networks Spectral Graph Theory Tools for the Department of Mathematics and Computer Science Emory University Atlanta, GA 30322, USA Acknowledgments Christine Klymko (Emory) Ernesto Estrada (Strathclyde, UK) Support:

More information

Epidemic Thresholds in Real Networks

Epidemic Thresholds in Real Networks IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 1 Epidemic Thresholds in Real Networks Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jurij Leskovec and Christos Faloutsos deepay@cs.cmu.edu, yangwang@andrew.cmu.edu,

More information

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies July 12, 212 Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies Morteza Mardani Dept. of ECE, University of Minnesota, Minneapolis, MN 55455 Acknowledgments:

More information

Two heads better than one: pattern discovery in time-evolving multi-aspect data

Two heads better than one: pattern discovery in time-evolving multi-aspect data Data Min Knowl Disc (2008) 17:111 128 DOI 10.1007/s10618-008-0112-3 Two heads better than one: pattern discovery in time-evolving multi-aspect data Jimeng Sun Charalampos E. Tsourakakis Evan Hoke Christos

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Dynamics of Real-world Networks

Dynamics of Real-world Networks Dynamics of Real-world Networks Thesis proposal Jurij Leskovec Machine Learning Department Carnegie Mellon University May 2, 2007 Thesis committee: Christos Faloutsos, CMU Avrim Blum, CMU John Lafferty,

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Spectral Clustering - Survey, Ideas and Propositions

Spectral Clustering - Survey, Ideas and Propositions Spectral Clustering - Survey, Ideas and Propositions Manu Bansal, CSE, IITK Rahul Bajaj, CSE, IITB under the guidance of Dr. Anil Vullikanti Networks Dynamics and Simulation Sciences Laboratory Virginia

More information

The Ties that Bind Characterizing Classes by Attributes and Social Ties

The Ties that Bind Characterizing Classes by Attributes and Social Ties The Ties that Bind WWW April, 2017, Bryan Perozzi*, Leman Akoglu Stony Brook University *Now at Google. Introduction Outline Our problem: Characterizing Community Differences Proposed Method Experimental

More information