CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism

Size: px

Start display at page:

Download "CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism"

Julius Rogers
6 years ago
Views:

1 Talk 3: Graph Mining Tools Tensors, communities, parallelism Christos Faloutsos CMU

2 Overall Outline Introduction Motivation Talk#1: Patterns in graphs; generators Talk#2: Tools (Ranking, proximity) Talk#3: Tools (Tensors, scalability) Conclusions Lipari 2010 (C) 2010, C. Faloutsos 2

12 Notice 3 rd mode does not need to be time we can have more than 3 modes Eg, ffmri: x,y,z, time, person-id, task-id From DENLAB, Temple U. (Prof. V. Megalooikonomou +) Lipari 2010 (C) 2010, C. Faloutsos 12

13 Motivating Applications Why tensors are useful? web mining (TOPHITS) environmental sensors Intrusion detection (src, dst, time, dest-port) Social networks (src, dst, time, type-of-contact) face recognition etc Lipari 2010 (C) 2010, C. Faloutsos 13

Goal: extension to >=3 modes I x J x K I x R J x R ~ A

20 Specially Structured Tensors Tucker Tensor Kruskal Tensor Our Notation Our Notation core I x J x K I x R J x S I x J x K w 1 I x R w R J x R = U R x S x T V = = u 1 v 1 U + + V R x R x R u R v R Lipari 2010 (C) 2010, C. Faloutsos 20

21 Tucker Decomposition - intuition I x J x K I x R J x S ~ A R x S x T B author x keyword x conference A: author x author-group B: keyword x keyword-group C: conf. x conf-group G: how groups relate to each other Lipari 2010 (C) 2010, C. Faloutsos 21

25 Two main tools PARAFAC Tucker Tensor tools - summary Both find row-, column-, tube-groups but in PARAFAC the three groups are identical ( To solve: Alternating Least Squares ) Lipari 2010 (C) 2010, C. Faloutsos 25

28 Kleinberg s Hubs and Authorities (the HITS method) Sparse adjacency matrix and its SVD: authority scores for 1 st topic authority scores for 2 nd topic to from hub scores for 1 st topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 28 Kleinberg, JACM, 1999

29 HITS Authorities on Sample Data We started our crawl from and crawled 4700 pages, resulting in 560 cross-linked hosts. authority scores for 1 st topic authority scores for 2 nd topic to from hub scores Lipari 2010 hub scores (C) 2010, C. Faloutsos for 1 st topic 29 for 2 nd topic

33 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 33

34 Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information. term scores for 1 st topic term scores for 2 nd topic to from hub scores for 1 st topic authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 nd topic Lipari 2010 (C) 2010, C. Faloutsos 34

w k = # unique links using term k Tensor PARAFAC term scores for 1 st topic term scores for 2 nd

35 TOPHITS Terms & Authorities on Sample Data TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms. w k = # unique links using term k Tensor PARAFAC term scores for 1 st topic term scores for 2 nd topic to from authority scores for 1 st topic authority scores for 2 nd topic hub scores for 2 Lipari 2010 hub scores nd topic (C) 2010, C. Faloutsos for 1 35 st topic

36 Conclusions Real data are often in high dimensions with multiple aspects (modes) Tensors provide elegant theory and algorithms PARAFAC and Tucker: discover groups Lipari 2010 (C) 2010, C. Faloutsos 36

37 References T. G. Kolda, B. W. Bader and J. P. Kenny. Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages , November Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006 Lipari 2010 (C) 2010, C. Faloutsos 37

39 Tensor tools - resources Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/tensortoolbox T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/tensorreview-preprint.pdf T. Kolda Lipari ICDE and J. Sun: Scalable Copyright: (C) Tensor 2010, Faloutsos, C. Faloutsos Tong Decomposition (2009) for Multi-Aspect 2-39 Data Mining (ICDM 2008)

45 Solution #1: METIS G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, <and many extensions> Lipari 2010 (C) 2010, C. Faloutsos 45

48 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 48

50 Co-clustering Given data matrix and the number of row and column groups k and l Simultaneously Cluster rows into k disjoint groups Cluster columns into l disjoint groups Lipari 2010 (C) 2010, C. Faloutsos 50

51 Co-clustering Let X and Y be discrete random variables X and Y take values in {1, 2,, m} and {1, 2,, n} p(x, Y) denotes the joint probability distribution if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms Reference: 1. Dhillon et al. Information-Theoretic Co-clustering, KDD 03 Lipari 2010 Copyright: (C) 2010, Faloutsos, C. Faloutsos Tong (2009) 2-51

55 Detailed outline Motivation Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Soft clustering matrix decompositions Observations Lipari 2010 (C) 2010, C. Faloutsos 55

56 Problem with Information Theoretic Co-clustering Number of row and column groups must be specified Desiderata: Simultaneously discover row and column groups " Fully Automatic: No magic numbers Scalable to large graphs Lipari 2010 (C) 2010, C. Faloutsos 56

large matrices Reference: 1. Chakrabarti et al.

57 Cross-association Desiderata: Simultaneously discover row and column groups Fully Automatic: No magic numbers Scalable to large matrices Reference: 1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD 04 Lipari 2010 (C) 2010, C. Faloutsos 57

59 What makes a cross-association good? Row groups versus Row groups Why is this better? Column groups Column groups simpler; easier to describe easier to compress! Lipari 2010 (C) 2010, C. Faloutsos 59

60 What makes a cross-association good? Problem definition: given an encoding scheme decide on the # of col. and row groups k and l and reorder rows and columns, to achieve best compression Lipari 2010 (C) 2010, C. Faloutsos 60

61 Main Idea Good Compression Better Clustering Total Encoding Cost = Σ i size i * H(x i ) + Cost of describing cross-associations Code Cost Description Cost Minimize the total cost (# bits) for lossless compression Lipari 2010 (C) 2010, C. Faloutsos 61

63 Experiments CLASSIC Documents 3,893 documents 4,303 words 176,347 dots Words Combination of 3 sources: MEDLINE (medical) CISI (info. retrieval) CRANFIELD (aerodynamics) Lipari 2010 (C) 2010, C. Faloutsos 63

65 Experiments insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell, tissue, patient MEDLINE (medical) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 65

66 Experiments providing, studying, records, development, students, rules abstract, notation, works, construct, bibliographies MEDLINE (medical) CISI (Information Retrieval) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 66

67 Experiments shape, nasa, leading, assumed, thin MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 67

68 Experiments paint, examination, fall, raise, leave, based MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) CLASSIC graph of documents & words: k=15, l=19 Lipari 2010 (C) 2010, C. Faloutsos 68

70 Algorithm Hadoop implementation [ICDM 08] Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map- Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: Lipari 2010 (C) 2010, C. Faloutsos 70

75 Jellyfish model [Tauro+] A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001 Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, Lipari M. Faloutsos, 2010 J. of Communications (C) 2010, C. Faloutsos and Networks, Vol. 8, No. 3, 75 pp , Sept

76 Strange behavior of min cuts negative dimensionality (!) NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy Statistical Properties of Community Structure in Large Social and Information Lipari 2010 Networks, J. Leskovec, (C) 2010, C. K. Faloutsos Lang, A. Dasgupta, M. Mahoney. 76 WWW 2008.

80 log (mincut-size / #edges) Min-cut plot log (mincut-size / #edges) Slope = -1/d log (# edges) For a d-dimensional grid, the slope is -1/d log (# edges) For a random graph, the slope is 0 Lipari 2010 (C) 2010, C. Faloutsos 80

82 Datasets: Experiments Google Web Graph: 916,428 nodes and 5,105,039 edges Lucent Router Graph: Undirected graph of network routers from 112,969 nodes and 181,639 edges User Website Clickstream Graph: 222,704 nodes and 952,580 edges NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, Lipari D Blandford, C. Faloutsos (C) 2010, C. and Faloutsos G. Blelloch, in the SDM Workshop on Link Analysis, Counter-terrorism and Privacy

83 Experiments Used the METIS algorithm [Karypis, Kumar, 1995] log (mincut-size / #edges) Slope~ -0.4 log (# edges) Google Web graph Values along the y- axis are averaged We observe a lip for large edges Slope of -0.4, corresponds to a 2.5- dimensional grid! Lipari 2010 (C) 2010, C. Faloutsos 83

86 Conclusions Practitioner s guide Hard clustering k pieces Hard co-clustering (k,l) pieces Hard clustering optimal # pieces Observations METIS Co-clustering Cross-associations jellyfish : Maybe, there are no good cuts Lipari 2010 (C) 2010, C. Faloutsos 86

91 Epidemic threshold τ of a graph: the value of τ, such that if strength s = β / δ < τ an epidemic can not happen Thus, given a graph compute its epidemic threshold Lipari 2010 (C) 2010, C. Faloutsos 91

94 Epidemic threshold [Theorem 1] We have no epidemic (*), if recovery prob. attack prob. Proof: [Wang+03] epidemic threshold β/δ <τ = 1/ λ 1,A largest eigenvalue of adj. matrix A (*) under mild, conditional-independence assumptions Lipari 2010 (C) 2010, C. Faloutsos 94

Healthy @ t+1: - ( healthy or healed ) - and not attacked @ t Beginning of proof Let: p(i, t) = Prob node i is sick @ t+1 1 - p(i, t+1 ) = (1 p(i, t) + p(i, t) * δ )

95 t+1: - ( healthy or healed ) - and not t Beginning of proof Let: p(i, t) = Prob node i is t p(i, t+1 ) = (1 p(i, t) + p(i, t) * δ ) * Π j (1 β aji * p(j, t) ) Below threshold, if the above non-linear dynamical system above is stable (eigenvalue of Hessian < 1 ) Lipari 2010 (C) 2010, C. Faloutsos 95

96 Epidemic threshold for various networks Formula includes older results as special cases: Homogeneous networks [Kephart+White] λ 1,A = <k>; τ = 1/<k> (<k> : avg degree) Star networks (d = degree of center) λ 1,A = sqrt(d); τ = 1/ sqrt(d) Infinite power-law networks λ 1,A = ; τ = 0 ; [Barabasi] Lipari 2010 (C) 2010, C. Faloutsos 96

104 Conclusions λ 1,A : Eigenvalue of adjacency matrix determines the survival of a flu-like virus It gives a measure of how well connected is the graph (~ # paths) May guide immunization policies Lipari 2010 (C) 2010, C. Faloutsos 104

105 References D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008 Ganesh, A., Massoulie, L., and Towsley, D., The effect of network topology on the spread of epidemics. In INFOCOM. Lipari 2010 (C) 2010, C. Faloutsos 105

106 References (cont d) Hethcote, H. W The mathematics of infectious diseases. SIAM Review 42, Hethcote, H. W. AND Yorke, J. A Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics. Lipari 2010 (C) 2010, C. Faloutsos 106

107 References (cont d) Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy Lipari 2010 (C) 2010, C. Faloutsos 107

111 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? Lipari 2010 (C) 2010, C. Faloutsos P8-111

112 Scalability Google: > 450,000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, Web Search for a Planet: The Google Cluster Architecture IEEE Micro 2003] Yahoo: 5Pb of data [Fayyad, KDD 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce hadoop (open-source clone) Lipari 2010 (C) 2010, C. Faloutsos P8-112

113 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id Lipari 2010 (C) 2010, C. Faloutsos P8-113

114 2 intro to hadoop master-slave architecture; n-way replication (default n=3) group by of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency compute local histograms then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id reduce map Lipari 2010 (C) 2010, C. Faloutsos P8-114

115 User Program fork fork fork Input Data (on HDFS) Split 0 Split 1 Split 2 read Mapper Mapper Mapper assign map local write Master assign reduce remote read, sort Reducer Reducer write Output File 0 Output File 1 By default: 3-way replication; Late/dead machines: ignored, transparently (!) Lipari 2010 (C) 2010, C. Faloutsos P8-115

D.I.S.C. Data Intensive Scientific Computing [R. Bryant, CMU] big data www.cs.

118 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 118

119 HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM 10 Naively: diameter needs O(N**2) space and up to O(N**3) time prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster Lipari 2010 (C) 2010, C. Faloutsos 119

Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.

125 details Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

126 Outline Algorithms & results Centralized Hadoop/ PEGASUS Degree Distr. old old Pagerank old old Diameter/ANF old DONE Conn. Comp old DONE Triangles Visualization DONE STARTED Lipari 2010 (C) 2010, C. Faloutsos 126

127 Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). Lipari 2010 (C) 2010, C. Faloutsos 127

128 Generalized Iterated Matrix Vector Multiplication (GIMV) details PageRank proximity (RWR) Diameter Connected components (eigenvectors, Belief Prop. ) Matrix vector Multiplication (iterated) Lipari 2010 (C) 2010, C. Faloutsos 128

136 References Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: Lipari 2010 (C) 2010, C. Faloutsos P8-136

137 Overall Conclusions Real graphs exhibit surprising patterns (power laws, shrinking diameter etc) SVD: a powerful tool (HITS, PageRank) Several other tools: tensors, METIS, Scalability: hadoop/parallelism Lipari 2010 (C) 2010, C. Faloutsos 137

NSF IIS-0705359, IIS-0534205, CTA-INARC;

139 Project info Chau, Polo McGlohon, Mary Tsourakakis, Babis Akoglu, Leman Kang, U Prakash, Aditya Tong, Hanghang Thanks to: NSF IIS , IIS , CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Lipari 2010 INTEL, HP (C) 2010, C. Faloutsos 139

142 E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Lipari 2010 (C) 2010, C. Faloutsos 142 Faloutsos (WWW'07), pp

147 OddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science PAKDD 2010, Hyderabad, India

150 Selected Features N i : number of neighbors (degree) of ego i E i : number of edges in egonet i W i : total weight of egonet i λ w,i : principal eigenvalue of the weighted adjacency matrix of egonet I Lipari 2010 (C) 2010, C. Faloutsos 150

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos Must-read Material 15-826: Multimedia Databases and Data Mining Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories,