CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism
|
|
- Julius Rogers
- 6 years ago
- Views:
Transcription
1 Talk 3: Graph Mining Tools Tensors, communities, parallelism Christos Faloutsos CMU
2 Overall Outline Introduction Motivation Talk#1: Patterns in graphs; generators Talk#2: Tools (Ranking, proximity) Talk#3: Tools (Tensors, scalability) Conclusions Lipari 2010 (C) 2010, C. Faloutsos 2
3 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 3
4 Tamara Kolda (Sandia) Thanks to for the foils on tensor definitions, and on TOPHITS Lipari 2010 (C) 2010, C. Faloutsos 4
5 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 5
6 Examples of Matrices: Authors and terms John Peter Mary Nick... data mining classif. tree... Lipari 2010 (C) 2010, C. Faloutsos 6
7 Motivation: Why tensors? Q: what is a tensor? Lipari 2010 (C) 2010, C. Faloutsos 7
8 Motivation: Why tensors? A: N-D generalization of matrix: KDD 09 John Peter Mary Nick... data mining classif. tree... Lipari 2010 (C) 2010, C. Faloutsos 8
9 Motivation: Why tensors? A: N-D generalization of matrix: KDD 07 KDD 08 KDD 09 data mining classif. tree... John Peter Mary Nick... Lipari 2010 (C) 2010, C. Faloutsos 9
10 Tensors are useful for 3 or more modes Terminology: mode (or aspect ): Mode#3 data mining classif. tree... Mode#2 Mode (== aspect) #1 Lipari 2010 (C) 2010, C. Faloutsos 10
11 Notice 3 rd mode does not need to be time we can have more than 3 modes Dest. port IP source IP destination Lipari 2010 (C) 2010, C. Faloutsos 11
12 Notice 3 rd mode does not need to be time we can have more than 3 modes Eg, ffmri: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +) Lipari 2010 (C) 2010, C. Faloutsos 12
13 Motivating Applications Why tensors are useful? web mining (TOPHITS) environmental sensors Intrusion detection (src, dst, time, dest-port) Social networks (src, dst, time, type-of-contact) face recognition etc Lipari 2010 (C) 2010, C. Faloutsos 13
14 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 14
15 Tensor basics Multi-mode extensions of SVD recall that: Lipari 2010 (C) 2010, C. Faloutsos 15
16 Reminder: SVD n n m A m Σ V T U Best rank-k approximation in L2 Lipari 2010 (C) 2010, C. Faloutsos 16
17 Reminder: SVD n σ 1 u 1 v 1 σ 2 u 2 v 2 m A + Best rank-k approximation in L2 Lipari 2010 (C) 2010, C. Faloutsos 17
18 Goal: extension to >=3 modes I x J x K I x R J x R ~ A R x R x R B = + + Lipari 2010 (C) 2010, C. Faloutsos 18
19 Main points: 2 major types of tensor decompositions: PARAFAC and Tucker both can be solved with ``alternating least squares (ALS) Lipari 2010 (C) 2010, C. Faloutsos 19
20 Specially Structured Tensors Tucker Tensor Kruskal Tensor Our Notation Our Notation core I x J x K I x R J x S I x J x K w 1 I x R w R J x R = U R x S x T V = = u 1 v 1 U + + V R x R x R u R v R Lipari 2010 (C) 2010, C. Faloutsos 20
21 Tucker Decomposition - intuition I x J x K I x R J x S ~ A R x S x T B author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G: how groups relate to each other Lipari 2010 (C) 2010, C. Faloutsos 21
22 Intuition behind core tensor 2-d case: co-clustering [Dhillon et al. Information-Theoretic Coclustering, KDD 03] Lipari 2010 (C) 2010, C. Faloutsos 22
23 n m eg, terms x documents m k k l l n Lipari 2010 (C) 2010, C. Faloutsos 23
24 med. doc cs doc med. terms term group x doc. group cs terms common terms term x term-group doc x doc group Lipari 2010 (C) 2010, C. Faloutsos 24
25 Two main tools PARAFAC Tucker Tensor tools - summary Both find row-, column-, tube-groups but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares ) Lipari 2010 (C) 2010, C. Faloutsos 25
26 Detailed outline Motivation Definitions: PARAFAC and Tucker Case study: web mining Lipari 2010 (C) 2010, C. Faloutsos 26
27 Web graph mining How to order the importance of web pages? Kleinberg s algorithm HITS PageRank Tensor extension on HITS (TOPHITS) Lipari 2010 (C) 2010, C. Faloutsos 27
28 Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic authority scores for 2 nd topic to from hub scores for 1 st topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 28 Kleinberg, JACM, 1999
29 HITS Authorities on Sample Data We started our crawl from and crawled 4700 pages, resulting in 560 cross-linked hosts. authority scores for 1 st topic authority scores for 2 nd topic to from hub scores Lipari 2010 hub scores (C) 2010, C. Faloutsos for 1 st topic 29 for 2 nd topic
30 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 30 Kolda, Bader, Kenny, ICDM05
31 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 31 Kolda, Bader, Kenny, ICDM05
32 Three-Dimensional View of the Web Observe that this tensor is very sparse! Lipari 2010 (C) 2010, C. Faloutsos 32 Kolda, Bader, Kenny, ICDM05
33 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 33
34 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 34
35 TOPHITS Terms & Authorities on Sample Data TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms. w k = # unique links using term k Tensor PARAFAC term scores for 1 st topic term scores for 2 nd topic to from authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 Lipari 2010 hub scores nd topic (C) 2010, C. Faloutsos for 1 35 st topic
36 Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms PARAFAC and Tucker: discover groups Lipari 2010 (C) 2010, C. Faloutsos 36
37 References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages , November Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006 Lipari 2010 (C) 2010, C. Faloutsos 37
38 Resources See tutorial on tensors, KDD 07 (w/ Tamara Kolda and Jimeng Sun): Lipari 2010 (C) 2010, C. Faloutsos 38
39 Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/tensortoolbox T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/tensorreview-preprint.pdf T. Kolda Lipari ICDE and J. Sun: Scalable Copyright: (C) Tensor 2010, Faloutsos, C. Faloutsos Tong Decomposition (2009) for Multi-Aspect 2-39 Data Mining (ICDM 2008)
40 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 40
41 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations Lipari 2010 (C) 2010, C. Faloutsos 41
42 Given a graph, and k Problem Break it into k (disjoint) communities Lipari 2010 (C) 2010, C. Faloutsos 42
43 Given a graph, and k Problem Break it into k (disjoint) communities k = 2 Lipari 2010 (C) 2010, C. Faloutsos 43
44 Solution #1: METIS Arguably, the best algorithm Open source, at and *many* related papers, at same url Main idea: coarsen the graph; partition; un-coarsen Lipari 2010 (C) 2010, C. Faloutsos 44
45 Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, <and many extensions> Lipari 2010 (C) 2010, C. Faloutsos 45
46 Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: Consider the 2 nd smallest eigenvector of the (normalized) Laplacian Lipari 2010 (C) 2010, C. Faloutsos 46
47 Many more ideas: Solutions #3, Clustering on the A 2 (square of adjacency matrix) [Zhou, Woodruff, PODS 04] Minimum cut / maximum flow [Flake+, KDD 00] Lipari 2010 (C) 2010, C. Faloutsos 47
48 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 48
49 Problem definition Given a bi-partite graph, and k, l Divide it into k row groups and l row groups (Also applicable to uni-partite graph) Lipari 2010 (C) 2010, C. Faloutsos 49
50 Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously Cluster rows into k disjoint groups Cluster columns into l disjoint groups Lipari 2010 (C) 2010, C. Faloutsos 50
51 Co-clustering Let X and Y be discrete random variables X and Y take values in {1, 2,, m} and {1, 2,, n} p(x, Y) denotes the joint probability distribution if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Reference: 1. Dhillon et al. Information-Theoretic Co-clustering, KDD 03 Lipari 2010 Copyright: (C) 2010, Faloutsos, C. Faloutsos Tong (2009) 2-51
52 n m eg, terms x documents m k k l l n Lipari 2010 (C) 2010, C. Faloutsos 52
53 med. doc cs doc med. terms term group x doc. group cs terms common terms doc x doc group term x term-group Lipari 2010 (C) 2010, C. Faloutsos 53
54 Observations Co-clustering uses KL divergence, instead of L2 the middle matrix is not diagonal we saw that earlier in the Tucker tensor decomposition s/w at: Lipari 2010 (C) 2010, C. Faloutsos 54
55 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 55
56 Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups " Fully Automatic: No magic numbers Scalable to large graphs Lipari 2010 (C) 2010, C. Faloutsos 56
57 Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No magic numbers Scalable to large matrices Reference: 1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD 04 Lipari 2010 (C) 2010, C. Faloutsos 57
58 What makes a cross-association good? Row groups versus Row groups Why is this better? Column groups Column groups Lipari 2010 (C) 2010, C. Faloutsos 58
59 What makes a cross-association good? Row groups versus Row groups Why is this better? Column groups Column groups simpler; easier to describe easier to compress! Lipari 2010 (C) 2010, C. Faloutsos 59
60 What makes a cross-association good? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression Lipari 2010 (C) 2010, C. Faloutsos 60
61 Main Idea Good Compression Better Clustering Total Encoding Cost = Σ i size i * H(x i ) + Cost of describing cross-associations Code Cost Description Cost Minimize the total cost (# bits) for lossless compression Lipari 2010 (C) 2010, C. Faloutsos 61
62 Algorithm l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Lipari 2010 (C) 2010, C. Faloutsos 62
63 Experiments CLASSIC Documents 3,893 documents 4,303 words 176,347 dots Words Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Lipari 2010 (C) 2010, C. Faloutsos 63
64 Experiments Documents Words CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 64
65 Experiments insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient MEDLINE (medical) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 65
66 Experiments providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies MEDLINE (medical) CISI (Information Retrieval) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 66
67 Experiments shape, nasa, leading, assumed, thin MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 67
68 Experiments paint, examination, fall, raise, leave, based MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 68
69 Algorithm Code for cross-associations (matlab): CrossAssociations tgz! Variations and extensions: Autopart [Chakrabarti, PKDD 04] Lipari 2010 (C) 2010, C. Faloutsos 69
70 Algorithm Hadoop implementation [ICDM 08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map- Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: Lipari 2010 (C) 2010, C. Faloutsos 70
71 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations Lipari 2010 (C) 2010, C. Faloutsos 71
72 Observation #1 Skewed degree distributions there are nodes with huge degree (>O(10^4), in facebook/linkedin popularity contests!) Lipari 2010 (C) 2010, C. Faloutsos 72
73 Observation #2 Maybe there are no good cuts: ``jellyfish shape [Tauro+ 01], [Siganos+, 06], strange behavior of cuts [Chakrabarti+ 04], [Leskovec+, 08] Lipari 2010 (C) 2010, C. Faloutsos 73
74 Observation #2 Maybe there are no good cuts: ``jellyfish shape [Tauro+ 01], [Siganos+, 06], strange behavior of cuts [Chakrabarti+, 04], [Leskovec+, 08]?? Lipari 2010 (C) 2010, C. Faloutsos 74
75 Jellyfish model [Tauro+] A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, Lipari M. Faloutsos, 2010 J. of Communications (C) 2010, C. Faloutsos and Networks, Vol. 8, No. 3, 75 pp , Sept
76 Strange behavior of min cuts negative dimensionality (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Lipari 2010 Networks, J. Leskovec, (C) 2010, C. K. Faloutsos Lang, A. Dasgupta, M. Mahoney. 76 WWW 2008.
77 Min-cut plot Do min-cuts recursively. log (mincut-size / #edges) Mincut size = sqrt(n) log (# edges) N nodes Lipari 2010 (C) 2010, C. Faloutsos 77
78 Min-cut plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) log (# edges) N nodes Lipari 2010 (C) 2010, C. Faloutsos 78
79 Min-cut plot Do min-cuts recursively. New min-cut log (mincut-size / #edges) Slope = -0.5 N nodes log (# edges) For a d-dimensional grid, the slope is -1/d Lipari 2010 (C) 2010, C. Faloutsos 79
80 log (mincut-size / #edges) Min-cut plot log (mincut-size / #edges) Slope = -1/d log (# edges) For a d-dimensional grid, the slope is -1/d log (# edges) For a random graph, the slope is 0 Lipari 2010 (C) 2010, C. Faloutsos 80
81 Min-cut plot What does it look like for a real-world graph? log (mincut-size / #edges)? log (# edges) Lipari 2010 (C) 2010, C. Faloutsos 81
82 Datasets: Experiments Google Web Graph: 916,428 nodes and 5,105,039 edges Lucent Router Graph: Undirected graph of network routers from 112,969 nodes and 181,639 edges User Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, Lipari D Blandford, C. Faloutsos (C) 2010, C. and Faloutsos G. Blelloch, in the SDM Workshop on Link Analysis, Counter-terrorism and Privacy
83 Experiments Used the METIS algorithm [Karypis, Kumar, 1995] log (mincut-size / #edges) Slope~ -0.4 log (# edges) Google Web graph Values along the y- axis are averaged We observe a lip for large edges Slope of -0.4, corresponds to a 2.5- dimensional grid! Lipari 2010 (C) 2010, C. Faloutsos 83
84 Google graph log (mincut-size / #edges) Log(#edges) All min-cuts Log(#edges) averaged Lipari 2010 (C) 2010, C. Faloutsos 84
85 Experiments Same results for other graphs too Lucent Router graph Clickstream graph Lipari 2010 (C) 2010, C. Faloutsos 85
86 Conclusions Practitioner s guide Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations METIS Co-clustering Cross-associations jellyfish : Maybe, there are no good cuts Lipari 2010 (C) 2010, C. Faloutsos 86
87 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 87
88 Epidemic threshold Problem definition Analysis Experiments Detailed outline Fraud detection in e-bay Lipari 2010 (C) 2010, C. Faloutsos 88
89 Virus propagation How do viruses/rumors propagate? Blog influence? Will a flu-like virus linger, or will it become extinct soon? Lipari 2010 (C) 2010, C. Faloutsos 89
90 The model: SIS Flu like: Susceptible-Infected-Susceptible Virus strength s= β/δ Prob. β Prob. δ N2 Healthy N1 N Infected N3 Lipari 2010 (C) 2010, C. Faloutsos 90
91 Epidemic threshold τ of a graph: the value of τ, such that if strength s = β / δ < τ an epidemic can not happen Thus, given a graph compute its epidemic threshold Lipari 2010 (C) 2010, C. Faloutsos 91
92 Epidemic threshold τ What should τ depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree? and/or diameter? Lipari 2010 (C) 2010, C. Faloutsos 92
93 Epidemic threshold [Theorem 1] We have no epidemic, if β/δ <τ = 1/ λ 1,A Lipari 2010 (C) 2010, C. Faloutsos 93
94 Epidemic threshold [Theorem 1] We have no epidemic (*), if recovery prob. attack prob. Proof: [Wang+03] epidemic threshold β/δ <τ = 1/ λ 1,A largest eigenvalue of adj. matrix A (*) under mild, conditional-independence assumptions Lipari 2010 (C) 2010, C. Faloutsos 94
95 t+1: - ( healthy or healed ) - and not t Beginning of proof Let: p(i, t) = Prob node i is t p(i, t+1 ) = (1 p(i, t) + p(i, t) * δ ) * Π j (1 β aji * p(j, t) ) Below threshold, if the above non-linear dynamical system above is stable (eigenvalue of Hessian < 1 ) Lipari 2010 (C) 2010, C. Faloutsos 95
96 Epidemic threshold for various networks Formula includes older results as special cases: Homogeneous networks [Kephart+White] λ 1,A = <k>; τ = 1/<k> (<k> : avg degree) Star networks (d = degree of center) λ 1,A = sqrt(d); τ = 1/ sqrt(d) Infinite power-law networks λ 1,A = ; τ = 0 ; [Barabasi] Lipari 2010 (C) 2010, C. Faloutsos 96
97 Epidemic threshold [Theorem 2] Below the epidemic threshold, the epidemic dies out exponentially Lipari 2010 (C) 2010, C. Faloutsos 97
98 Epidemic threshold Problem definition Analysis Experiments Detailed outline Lipari 2010 (C) 2010, C. Faloutsos 98
99 Experiments (Oregon) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) Lipari 2010 (C) 2010, C. Faloutsos 99
100 SIS simulation - # infected nodes vs time Log - Lin #inf. (log scale) above at below Time (linear scale) Lipari 2010 (C) 2010, C. Faloutsos 100
101 SIS simulation - # infected nodes vs time Log - Lin #inf. (log scale) above at below Exponential decay Time (linear scale) Lipari 2010 (C) 2010, C. Faloutsos 101
102 SIS simulation - # infected nodes vs time Log - Log #inf. (log scale) above at below Time (log scale) Lipari 2010 (C) 2010, C. Faloutsos 102
103 SIS simulation - # infected nodes vs time Log - Log #inf. (log scale) Power-law Decay (!) above at below Time (log scale) Lipari 2010 (C) 2010, C. Faloutsos 103
104 Conclusions λ 1,A : Eigenvalue of adjacency matrix determines the survival of a flu-like virus It gives a measure of how well connected is the graph (~ # paths) May guide immunization policies Lipari 2010 (C) 2010, C. Faloutsos 104
105 References D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008 Ganesh, A., Massoulie, L., and Towsley, D., The effect of network topology on the spread of epidemics. In INFOCOM. Lipari 2010 (C) 2010, C. Faloutsos 105
106 References (cont d) Hethcote, H. W The mathematics of infectious diseases. SIAM Review 42, Hethcote, H. W. AND Yorke, J. A Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics. Lipari 2010 (C) 2010, C. Faloutsos 106
107 References (cont d) Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy Lipari 2010 (C) 2010, C. Faloutsos 107
108 Outline Task 4: time-evolving graphs tensors Task 5: community detection Task 6: virus propagation Task 7: scalability, parallelism and hadoop Conclusions Lipari 2010 (C) 2010, C. Faloutsos 108
109 Scalability How about if graph/tensor does not fit in core? How about handling huge graphs? Lipari 2010 (C) 2010, C. Faloutsos P8-109
110 Scalability How about if graph/tensor does not fit in core? [ MET : Kolda, Sun, ICMD 08, best paper award] How about handling huge graphs? Lipari 2010 (C) 2010, C. Faloutsos P8-110
111 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? Lipari 2010 (C) 2010, C. Faloutsos P8-111
112 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce hadoop (open-source clone) Lipari 2010 (C) 2010, C. Faloutsos P8-112
113 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id Lipari 2010 (C) 2010, C. Faloutsos P8-113
114 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id reduce map Lipari 2010 (C) 2010, C. Faloutsos P8-114
115 User Program fork fork fork Input Data (on HDFS) Split 0 Split 1 Split 2 read Mapper Mapper Mapper assign map local write Master assign reduce remote read, sort Reducer Reducer write Output File 0 Output File 1 By default: 3-way replication; Late/dead machines: ignored, transparently (!) Lipari 2010 (C) 2010, C. Faloutsos P8-115
116 D.I.S.C. Data Intensive Scientific Computing [R. Bryant, CMU] big data Lipari 2010 (C) 2010, C. Faloutsos P8-116
117 Analysis of a large graph ~200Gb (Yahoo crawl) - Degree Distribution: in 12 minutes with 50 machines Many (link spams?) at out-degree 1200 Lipari 2010 (C) 2010, C. Faloutsos P8-117
118 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 118
119 HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM 10 Naively: diameter needs O(N**2) space and up to O(N**3) time prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster Lipari 2010 (C) 2010, C. Faloutsos 119
120 Count?????? 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. Lipari 2010 (C) 2010, C. Faloutsos 120
121 Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. Lipari 2010 (C) 2010, C. Faloutsos 121
122 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. Lipari 2010 (C) 2010, C. Faloutsos 122
123 YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality: probably mixture of cores. Lipari 2010 (C) 2010, C. Faloutsos 123
124 Radius Plot of GCC of YahooWeb. Lipari 2010 (C) 2010, C. Faloutsos 124
125 details Running time - Kronecker and Erdos-Renyi Graphs with billions edges.
126 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 126
127 Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). Lipari 2010 (C) 2010, C. Faloutsos 127
128 Generalized Iterated Matrix Vector Multiplication (GIMV) details PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. ) Matrix vector Multiplication (iterated) Lipari 2010 (C) 2010, C. Faloutsos 128
129 Example: GIM-V At Work Connected Components Count Lipari 2010 Size (C) 2010, C. Faloutsos 129
130 Example: GIM-V At Work Connected Components Count ~0.7B singleton nodes Lipari 2010 Size (C) 2010, C. Faloutsos 130
131 Example: GIM-V At Work Connected Components Count Lipari 2010 Size (C) 2010, C. Faloutsos 131
132 Example: GIM-V At Work Connected Components Count 300-size cmpt X size cmpt Why? X 65. Why? Lipari 2010 Size (C) 2010, C. Faloutsos 132
133 Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Lipari 2010 Size (C) 2010, C. Faloutsos 133
134 GIM-V At Work Connected Components over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point Lipari 2010 (C) 2010, C. Faloutsos 134
135 Conclusions Hadoop: promising architecture for Tera/ Peta scale graph mining Resources: Higher-level language for data processing Lipari 2010 (C) 2010, C. Faloutsos P8-135
136 References Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: Lipari 2010 (C) 2010, C. Faloutsos P8-136
137 Overall Conclusions Real graphs exhibit surprising patterns (power laws, shrinking diameter etc) SVD: a powerful tool (HITS, PageRank) Several other tools: tensors, METIS, Scalability: hadoop/parallelism Lipari 2010 (C) 2010, C. Faloutsos 137
138 Our goal: Open source system for mining huge graphs: PEGASUS project (PEta GrAph mining System) code and papers Lipari 2010 (C) 2010, C. Faloutsos 138
139 Project info Chau, Polo McGlohon, Mary Tsourakakis, Babis Akoglu, Leman Kang, U Prakash, Aditya Tong, Hanghang Thanks to: NSF IIS , IIS , CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Lipari 2010 INTEL, HP (C) 2010, C. Faloutsos 139
140 Extra material E-bay fraud detection Outlier detection Lipari 2010 (C) 2010, C. Faloutsos 140
141 Detailed outline Fraud detection in e-bay Anomaly detection Lipari 2010 (C) 2010, C. Faloutsos 141
142 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Lipari 2010 (C) 2010, C. Faloutsos 142 Faloutsos (WWW'07), pp
143 E-bay Fraud detection lines: positive feedbacks would you buy from him/her? Lipari 2010 (C) 2010, C. Faloutsos 143
144 E-bay Fraud detection lines: positive feedbacks would you buy from him/her? or him/her? Lipari 2010 (C) 2010, C. Faloutsos 144
145 E-bay Fraud detection - NetProbe Belief Propagation gives: Lipari 2010 (C) 2010, C. Faloutsos 145
146 Extra material E-bay fraud detection Outlier detection Lipari 2010 (C) 2010, C. Faloutsos 146
147 OddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science PAKDD 2010, Hyderabad, India
148 For each node, Main idea extract ego-net (=1-step-away neighbors) Extract features (#edges, total weight, etc etc) Compare with the rest of the population Lipari 2010 (C) 2010, C. Faloutsos 148
149 What is an egonet? ego egonet Lipari 2010 (C) 2010, C. Faloutsos 149
150 Selected Features N i : number of neighbors (degree) of ego i E i : number of edges in egonet i W i : total weight of egonet i λ w,i : principal eigenvalue of the weighted adjacency matrix of egonet I Lipari 2010 (C) 2010, C. Faloutsos 150
151 Near-Clique/Star Lipari 2010 (C) 2010, C. Faloutsos 151
152 Near-Clique/Star Lipari 2010 (C) 2010, C. Faloutsos 152
153 END Arrivederci Lipari 2010 (C) 2010, C. Faloutsos 153
Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos
Must-read Material 15-826: Multimedia Databases and Data Mining Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories,
More informationMust-read material (1 of 2) : Multimedia Databases and Data Mining. Must-read material (2 of 2) Main outline
Must-read material (1 of 2) 15-826: Multimedia Databases and Data Mining Lecture #29: Graph mining - Generators & tools Fully Automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha
More informationLarge Graph Mining: Power Tools and a Practitioner s guide
Large Graph Mining: Power Tools and a Practitioner s guide Task 5: Graphs over time & tensors Faloutsos, Miller, Tsourakakis CMU KDD '9 Faloutsos, Miller, Tsourakakis P5-1 Outline Introduction Motivation
More informationFaloutsos, Tong ICDE, 2009
Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity
More informationCMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools
Mining Billion-node Graphs: Patterns, Generators and Tools Christos Faloutsos CMU (on sabbatical at google) Tamara Kolda Thank you! C. Faloutsos (CMU + Google) 2 Our goal: Open source system for mining
More informationDATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering
DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,
More informationCMU SCS. Large Graph Mining. Christos Faloutsos CMU
Large Graph Mining Christos Faloutsos CMU Thank you! Hillol Kargupta NGDM 2007 C. Faloutsos 2 Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece graphs; fraud
More informationWindow-based Tensor Analysis on High-dimensional and Multi-aspect Streams
Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams Jimeng Sun Spiros Papadimitriou Philip S. Yu Carnegie Mellon University Pittsburgh, PA, USA IBM T.J. Watson Research Center Hawthorne,
More informationThanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides
Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank
More informationThanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides
Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank
More informationRTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu Mary McGlohon Christos Faloutsos Carnegie Mellon University, School of Computer Science {lakoglu, mmcgloho, christos}@cs.cmu.edu
More informationData mining in large graphs
Data mining in large graphs Christos Faloutsos University www.cs.cmu.edu/~christos ALLADIN 2003 C. Faloutsos 1 Outline Introduction - motivation Patterns & Power laws Scalability & Fast algorithms Fractals,
More informationLinear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 2. Basic Linear Algebra continued Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki
More informationAnomaly Detection. Davide Mottin, Konstantina Lazaridou. HassoPlattner Institute. Graph Mining course Winter Semester 2016
Anomaly Detection Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016 and Next week February 7, third round presentations Slides are due by February 6,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com
More informationNetworks as vectors of their motif frequencies and 2-norm distance as a measure of similarity
Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity CS322 Project Writeup Semih Salihoglu Stanford University 353 Serra Street Stanford, CA semih@stanford.edu
More informationMotivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to ge
Graph Mining: Laws, Generators and Tools Christos Faloutsos CMU Thank you! Sharad Mehrotra UCI 2007 C. Faloutsos 2 Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece
More informationToward Understanding Spatial Dependence on Epidemic Thresholds in Networks
Toward Understanding Spatial Dependence on Epidemic Thresholds in Networks Zesheng Chen Department of Computer Science Indiana University - Purdue University Fort Wayne, Indiana 4685 Email: chenz@ipfw.edu
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University.
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu March 16, 2016 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More informationCS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #8: Epidemics: Thresholds
CS 6604: Data Mining Large Networks and Time-series B. Aditya Prakash Lecture #8: Epidemics: Thresholds A fundamental ques@on Strong Virus Epidemic? 2 example (sta@c graph) Weak Virus Epidemic? 3 Problem
More informationOnline Social Networks and Media. Link Analysis and Web Search
Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationLINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models
More information6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search
6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search)
More informationCMU SCS Mining Large Graphs: Patterns, Anomalies, and Fraud Detection
Mining Large Graphs: Patterns, Anomalies, and Fraud Detection Christos Faloutsos CMU Thank you! Prof. Julian McAuley Nicholas Urioste UCSD'16 (c) 2016, C. Faloutsos 2 Roadmap Introduction Motivation Why
More informationDATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS
DATA MINING LECTURE 3 Link Analysis Ranking PageRank -- Random walks HITS How to organize the web First try: Manually curated Web Directories How to organize the web Second try: Web Search Information
More informationCS224W: Analysis of Networks Jure Leskovec, Stanford University
Announcements: Please fill HW Survey Weekend Office Hours starting this weekend (Hangout only) Proposal: Can use 1 late period CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart
More informationSlide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering
More informationAnalytically tractable processes on networks
University of California San Diego CERTH, 25 May 2011 Outline Motivation 1 Motivation Networks Random walk and Consensus Epidemic models Spreading processes on networks 2 Networks Motivation Networks Random
More informationCS224W: Methods of Parallelized Kronecker Graph Generation
CS224W: Methods of Parallelized Kronecker Graph Generation Sean Choi, Group 35 December 10th, 2012 1 Introduction The question of generating realistic graphs has always been a topic of huge interests.
More informationOnline Social Networks and Media. Link Analysis and Web Search
Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More informationLarge Graph Mining: Power Tools and a Practitioner s guide
Large Graph Mining: Power Tools and a Practitioner s guide Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU KDD'09 Faloutsos, Miller, Tsourakakis P3-1 Outline Introduction Motivation
More informationTalk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU
Talk 2: Graph Mining Tools - SVD, ranking, proximity Christos Faloutsos CMU Outline Introduction Motivation Task 1: Node importance Task 2: Recommendations Task 3: Connection sub-graphs Conclusions Lipari
More informationLink Analysis Ranking
Link Analysis Ranking How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it? Naïve ranking of query results Given query
More informationAnomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach
Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Anomaly Detection in Temporal Graph Data: An Iterative Tensor Decomposition and Masking Approach Anna
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Task: Find coalitions in signed networks Incentives: European
More informationHigh Performance Parallel Tucker Decomposition of Sparse Tensors
High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,
More informationA new centrality measure for probabilistic diffusion in network
ACSIJ Advances in Computer Science: an International Journal, Vol. 3, Issue 5, No., September 204 ISSN : 2322-557 A new centrality measure for probabilistic diffusion in network Kiyotaka Ide, Akira Namatame,
More informationUnsupervised Anomaly Detection for High Dimensional Data
Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation
More informationRTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Submitted for Blind Review Abstract How do real, weighted graphs change over time? What patterns, if any, do they obey? Earlier studies
More informationCom2: Fast Automatic Discovery of Temporal ( Comet ) Communities
Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Miguel Araujo CMU/University of Porto maraujo@cs.cmu.edu Christos Faloutsos* Carnegie Mellon University christos@cs.cmu.edu Spiros Papadimitriou
More informationRainfall data analysis and storm prediction system
Rainfall data analysis and storm prediction system SHABARIRAM, M. E. Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/15778/ This document is the author deposited
More informationMAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds
MAE 298, Lecture 8 Feb 4, 2008 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in a file-sharing
More informationGraph Analysis Using Map/Reduce
Seminar: Massive-Scale Graph Analysis Summer Semester 2015 Graph Analysis Using Map/Reduce Laurent Linden s9lalind@stud.uni-saarland.de May 14, 2015 Introduction Peta-Scale Graph + Graph mining + MapReduce
More informationLecture VI Introduction to complex networks. Santo Fortunato
Lecture VI Introduction to complex networks Santo Fortunato Plan of the course I. Networks: definitions, characteristics, basic concepts in graph theory II. III. IV. Real world networks: basic properties
More informationCS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine
CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More information1 Matrix notation and preliminaries from spectral graph theory
Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.
More informationCloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data
Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data Arvind Ramanathan ramanathana@ornl.gov Computational Data Analytics Group, Computer Science & Engineering
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Graph and Network Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Methods Learnt Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationCMU SCS : Multimedia Databases and Data Mining CMU SCS Must-read Material CMU SCS Must-read Material (cont d)
15-826: Multimedia Databases and Data Mining Lecture #26: Graph mining - patterns Christos Faloutsos Must-read Material Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, On Power-Law Relationships
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationSelf Similar (Scale Free, Power Law) Networks (I)
Self Similar (Scale Free, Power Law) Networks (I) E6083: lecture 4 Prof. Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu February 7, 2007
More informationCom2: Fast Automatic Discovery of Temporal ( Comet ) Communities
Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities Miguel Araujo 1,5, Spiros Papadimitriou 2, Stephan Günnemann 1, Christos Faloutsos 1, Prithwish Basu 3, Ananthram Swami 4, Evangelos E.
More informationHEigen: Spectral Analysis for Billion-Scale Graphs
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, MONTH 2012 1 HEigen: Spectral Analysis for Billion-Scale Graphs U Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos
More informationFitting a Tensor Decomposition is a Nonlinear Optimization Problem
Fitting a Tensor Decomposition is a Nonlinear Optimization Problem Evrim Acar, Daniel M. Dunlavy, and Tamara G. Kolda* Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia
More informationProblem (INFORMAL). Given a dynamic graph, find a set of possibly overlapping temporal subgraphs to concisely describe the given dynamic graph in a
Outlines TimeCrunch: Interpretable Dynamic Graph Summarization by Neil Shah et. al. (KDD 2015) From Micro to Macro: Uncovering and Predicting Information Cascading Process with Behavioral Dynamics by Linyun
More informationCS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications
More informationHub, Authority and Relevance Scores in Multi-Relational Data for Query Search
Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search Xutao Li 1 Michael Ng 2 Yunming Ye 1 1 Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology,
More informationLink Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Link Analysis Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages
More informationBig Tensor Data Reduction
Big Tensor Data Reduction Nikos Sidiropoulos Dept. ECE University of Minnesota NSF/ECCS Big Data, 3/21/2013 Nikos Sidiropoulos Dept. ECE University of Minnesota () Big Tensor Data Reduction NSF/ECCS Big
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationLink Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci
Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor
More informationUsing R for Iterative and Incremental Processing
Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant
More informationCS60021: Scalable Data Mining. Dimensionality Reduction
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a
More informationOutline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.
Outline 15-826: Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos Goal: Find similar / interesting things Intro to DB Indexing - similarity search Points Text Time sequences; images
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationCommunities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices
Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into
More information1 Searching the World Wide Web
Hubs and Authorities in a Hyperlinked Environment 1 Searching the World Wide Web Because diverse users each modify the link structure of the WWW within a relatively small scope by creating web-pages on
More informationCompressing Tabular Data via Pairwise Dependencies
Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research TCE Conference, June 22, 2017 Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford) Huge datasets: everywhere - Internet
More informationLecture 7 Mathematics behind Internet Search
CCST907 Hidden Order in Daily Life: A Mathematical Perspective Lecture 7 Mathematics behind Internet Search Dr. S. P. Yung (907A) Dr. Z. Hua (907B) Department of Mathematics, HKU Outline Google is the
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data
More informationTensor Analysis. Topics in Data Mining Fall Bruno Ribeiro
Tensor Analysis Topics in Data Mining Fall 2015 Bruno Ribeiro Tensor Basics But First 2 Mining Matrices 3 Singular Value Decomposition (SVD) } X(i,j) = value of user i for property j i 2 j 5 X(Alice, cholesterol)
More informationLecture: Local Spectral Methods (1 of 4)
Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide
More informationLecture 12: Link Analysis for Web Retrieval
Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities
More informationAs it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation
Graphs and Networks Page 1 Lecture 2, Ranking 1 Tuesday, September 12, 2006 1:14 PM I. II. I. How search engines work: a. Crawl the web, creating a database b. Answer query somehow, e.g. grep. (ex. Funk
More informationto be more efficient on enormous scale, in a stream, or in distributed settings.
16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to
More informationA New Space for Comparing Graphs
A New Space for Comparing Graphs Anshumali Shrivastava and Ping Li Cornell University and Rutgers University August 18th 2014 Anshumali Shrivastava and Ping Li ASONAM 2014 August 18th 2014 1 / 38 Main
More informationAlgebraic Representation of Networks
Algebraic Representation of Networks 0 1 2 1 1 0 0 1 2 0 0 1 1 1 1 1 Hiroki Sayama sayama@binghamton.edu Describing networks with matrices (1) Adjacency matrix A matrix with rows and columns labeled by
More informationSpectral Graph Theory Tools. Analysis of Complex Networks
Spectral Graph Theory Tools for the Department of Mathematics and Computer Science Emory University Atlanta, GA 30322, USA Acknowledgments Christine Klymko (Emory) Ernesto Estrada (Strathclyde, UK) Support:
More informationEpidemic Thresholds in Real Networks
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 1 Epidemic Thresholds in Real Networks Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jurij Leskovec and Christos Faloutsos deepay@cs.cmu.edu, yangwang@andrew.cmu.edu,
More informationRecovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies
July 12, 212 Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies Morteza Mardani Dept. of ECE, University of Minnesota, Minneapolis, MN 55455 Acknowledgments:
More informationTwo heads better than one: pattern discovery in time-evolving multi-aspect data
Data Min Knowl Disc (2008) 17:111 128 DOI 10.1007/s10618-008-0112-3 Two heads better than one: pattern discovery in time-evolving multi-aspect data Jimeng Sun Charalampos E. Tsourakakis Evan Hoke Christos
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationDynamics of Real-world Networks
Dynamics of Real-world Networks Thesis proposal Jurij Leskovec Machine Learning Department Carnegie Mellon University May 2, 2007 Thesis committee: Christos Faloutsos, CMU Avrim Blum, CMU John Lafferty,
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationSpectral Clustering - Survey, Ideas and Propositions
Spectral Clustering - Survey, Ideas and Propositions Manu Bansal, CSE, IITK Rahul Bajaj, CSE, IITB under the guidance of Dr. Anil Vullikanti Networks Dynamics and Simulation Sciences Laboratory Virginia
More informationThe Ties that Bind Characterizing Classes by Attributes and Social Ties
The Ties that Bind WWW April, 2017, Bryan Perozzi*, Leman Akoglu Stony Brook University *Now at Google. Introduction Outline Our problem: Characterizing Community Differences Proposed Method Experimental
More information