CMU SCS. Large Graph Mining. Christos Faloutsos CMU

Similar documents
Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to ge

Data mining in large graphs

CMU SCS Mining Billion-node Graphs: Patterns, Generators and Tools

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication

CMU SCS : Multimedia Databases and Data Mining CMU SCS Must-read Material CMU SCS Must-read Material (cont d)

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Faloutsos, Tong ICDE, 2009

arxiv:physics/ v3 [physics.soc-ph] 28 Jan 2007

Dynamics of Real-world Networks

Must-read material (1 of 2) : Multimedia Databases and Data Mining. Must-read material (2 of 2) Main outline

CMU SCS Mining Large Graphs: Patterns, Anomalies, and Fraud Detection

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs

Finding patterns in large, real networks

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos


Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Analysis & Generative Model for Trust Networks

Large Graph Mining: Power Tools and a Practitioner s guide

Subgraph Detection Using Eigenvector L1 Norms

Supporting Statistical Hypothesis Testing Over Graphs

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs

Talk 2: Graph Mining Tools - SVD, ranking, proximity. Christos Faloutsos CMU

Online Sampling of High Centrality Individuals in Social Networks

arxiv: v2 [stat.ml] 21 Aug 2009

CS224W: Methods of Parallelized Kronecker Graph Generation

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

RaRE: Social Rank Regulated Large-scale Network Embedding

CMU SCS Talk 3: Graph Mining Tools Tensors, communities, parallelism

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search

Time Series Mining and Forecasting

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Introduction to Data Mining

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting

Jure Leskovec Stanford University

Large Graph Mining: Power Tools and a Practitioner s guide

Link Analysis Ranking

Power Laws & Rich Get Richer

Slides based on those in:

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Problem (INFORMAL). Given a dynamic graph, find a set of possibly overlapping temporal subgraphs to concisely describe the given dynamic graph in a

CS224W: Social and Information Network Analysis

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Data Mining Techniques

IR: Information Retrieval

Analytic Theory of Power Law Graphs

Weighted graphs and disconnected components

Link Analysis. Leonid E. Zhukov

Weighted Graphs and Disconnected Components

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Multimedia Databases - 68A6 Final Term - exercises

CS6220: DATA MINING TECHNIQUES

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES


Time Series. CSE 6242 / CX 4242 Mar 27, Duen Horng (Polo) Chau Georgia Tech. Nonlinear Forecasting; Visualization; Applications

Network Analysis and Modeling

1 Searching the World Wide Web

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Scaling Neighbourhood Methods

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Complex Social System, Elections. Introduction to Network Analysis 1

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

CS60021: Scalable Data Mining. Dimensionality Reduction

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Integrated Anchor and Social Link Predictions across Social Networks

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Data Mining and Matrices

Spectral Analysis of k-balanced Signed Graphs

Social-Affiliation Networks: Patterns and the SOAR Model

Math 304 Handout: Linear algebra, graphs, and networks.

Inf 2B: Ranking Queries on the WWW

Com2: Fast Automatic Discovery of Temporal ( Comet ) Communities

Fast Direction-Aware Proximity for Graph Mining

THE POWER OF CERTAINTY

Unsupervised Anomaly Detection for High Dimensional Data

Anomaly Detection. Davide Mottin, Konstantina Lazaridou. HassoPlattner Institute. Graph Mining course Winter Semester 2016

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Long Term Evolution of Networks CS224W Final Report

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Lecture 7 Mathematics behind Internet Search

A spectral algorithm for computing social balance

CS249: ADVANCED DATA MINING

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Analysis of Large Multi-modal Social Networks: Patterns and a Generator

Transcription:

Large Graph Mining Christos Faloutsos CMU

Thank you! Hillol Kargupta NGDM 2007 C. Faloutsos 2

Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece graphs; fraud detection Conclusions NGDM 2007 C. Faloutsos 3

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to generate realistic graphs TOOLS Problem#4: Who is the master-mind? Problem#5: Fraud detection NGDM 2007 C. Faloutsos 4

Problem#1: Joint work with Dr. Deepayan Chakrabarti (CMU/Yahoo R.L.) NGDM 2007 C. Faloutsos 5

Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez 91] Friendship Network [Moody 01] Protein Interactions [genomebiology.com] NGDM 2007 C. Faloutsos 6

Graphs - why should we care? IR: bi-partite graphs (doc-terms) D 1...... T 1 web: hyper-text graph D N T M... and more: NGDM 2007 C. Faloutsos 7

Graphs - why should we care? network of companies & board-of-directors members viral marketing web-log ( blog ) news propagation computer network security: email/ip traffic and anomaly detection... NGDM 2007 C. Faloutsos 8

Problem #1 - network and graph mining How does the Internet look like? How does the web look like? What is normal / abnormal? which patterns/laws hold? NGDM 2007 C. Faloutsos 9

Graph mining Are real graphs random? NGDM 2007 C. Faloutsos 10

Laws and patterns Are real graphs random? A: NO!! Diameter in- and out- degree distributions other (surprising) patterns NGDM 2007 C. Faloutsos 11

Solution#1 Power law in the degree distribution [SIGCOMM99] log(degree) ibm.com internet domains att.com -0.82 log(rank) NGDM 2007 C. Faloutsos 12

Solution#1 : Eigen Exponent E Eigenvalue Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue A2: power law in the eigenvalues of the adjacency matrix NGDM 2007 C. Faloutsos 13

But: How about graphs from other domains? NGDM 2007 C. Faloutsos 14

More power laws: web hit counts [w/ A. Montgomery] log(count) Web Site Traffic Zipf ``ebay users sites log(in-degree) NGDM 2007 C. Faloutsos 15

epinions.com count who-trusts-whom [Richardson + Domingos, KDD 2001] trusts-2000-people user (out) degree NGDM 2007 C. Faloutsos 16

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to generate realistic graphs TOOLS Problem#4: Who is the master-mind? Problem#5: Fraud detection NGDM 2007 C. Faloutsos 17

Problem#2: Time evolution with Jure Leskovec (CMU/MLD) and Jon Kleinberg (Cornell sabb. @ CMU) NGDM 2007 C. Faloutsos 18

Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? NGDM 2007 C. Faloutsos 19

Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time NGDM 2007 C. Faloutsos 20

Diameter ArXiv citation graph Citations among physics papers 1992 2003 One graph per year diameter time [years] NGDM 2007 C. Faloutsos 21

Diameter Autonomous Systems Graph of Internet One graph per day 1997 2000 diameter number of nodes NGDM 2007 C. Faloutsos 22

Diameter Affiliation Network Graph of collaborations in physics authors linked to papers 10 years of data diameter time [years] NGDM 2007 C. Faloutsos 23

Diameter Patents Patent citation network 25 years of data diameter time [years] NGDM 2007 C. Faloutsos 24

Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) NGDM 2007 C. Faloutsos 25

Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! But obeying the ``Densification Power Law NGDM 2007 C. Faloutsos 26

Densification Physics Citations Citations among physics papers 2003: 29,555 papers, 352,807 citations E(t)?? NGDM 2007 C. Faloutsos 27 N(t)

Densification Physics Citations Citations among physics papers 2003: 29,555 papers, 352,807 citations E(t) 1.69 NGDM 2007 C. Faloutsos 28 N(t)

Densification Physics Citations Citations among physics papers 2003: 29,555 papers, 352,807 citations E(t) 1.69 1: tree NGDM 2007 C. Faloutsos 29 N(t)

Densification Physics Citations Citations among physics papers 2003: 29,555 papers, 352,807 citations E(t) clique: 2 1.69 NGDM 2007 C. Faloutsos 30 N(t)

Densification Patent Citations Citations among patents granted 1999 2.9 million nodes 16.5 million edges Each year is a datapoint E(t) 1.66 N(t) NGDM 2007 C. Faloutsos 31

Densification Autonomous Systems Graph of Internet 2000 6,000 nodes 26,000 edges One graph per day E(t) 1.18 N(t) NGDM 2007 C. Faloutsos 32

Densification Affiliation Authors linked to their publications 2002 60,000 nodes 20,000 authors 38,000 papers 133,000 edges Network E(t) 1.15 N(t) NGDM 2007 C. Faloutsos 33

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to generate realistic graphs TOOLS Problem#4: Who is the master-mind? Problem#5: Fraud detection NGDM 2007 C. Faloutsos 34

Problem#3: Generation Given a growing graph with count of nodes N 1, N 2, Generate a realistic sequence of graphs that will obey all the patterns NGDM 2007 C. Faloutsos 35

Problem Definition Given a growing graph with count of nodes N 1, N 2, Generate a realistic sequence of graphs that will obey all the patterns Static Patterns Power Law Degree Distribution Power Law eigenvalue and eigenvector distribution Small Diameter Dynamic Patterns Growth Power Law Shrinking/Stabilizing Diameters NGDM 2007 C. Faloutsos 36

Problem Definition Given a growing graph with count of nodes N 1, N 2, Generate a realistic sequence of graphs that will obey all the patterns Idea: Self-similarity Leads to power laws Communities within communities NGDM 2007 C. Faloutsos 37

Kronecker Product a Graph Intermediate stage Adjacency matrix Adjacency matrix NGDM 2007 C. Faloutsos 38

Kronecker Product a Graph Continuing multiplying with G 1 we obtain G 4 and so on G 4 adjacency matrix NGDM 2007 C. Faloutsos 39

Kronecker Product a Graph Continuing multiplying with G 1 we obtain G 4 and so on G 4 adjacency matrix NGDM 2007 C. Faloutsos 40

Kronecker Product a Graph Continuing multiplying with G 1 we obtain G 4 and so on G 4 adjacency matrix NGDM 2007 C. Faloutsos 41

We can PROVE that Properties: Degree distribution is multinomial ~ power law Diameter: constant Eigenvalue distribution: multinomial First eigenvector: multinomial See [Leskovec+, PKDD 05] for proofs NGDM 2007 C. Faloutsos 42

Problem Definition Given a growing graph with nodes N 1, N 2, Generate a realistic sequence of graphs that will obey all the patterns Static Patterns Power Law Degree Distribution Power Law eigenvalue and eigenvector distribution Small Diameter Dynamic Patterns Growth Power Law Shrinking/Stabilizing Diameters First and only generator for which we can prove all these properties NGDM 2007 C. Faloutsos 43

(Q: how to fit the parm s?) A: Stochastic version of Kronecker graphs + Max likelihood + Metropolis sampling [Leskovec+, ICML 07] NGDM 2007 C. Faloutsos 44

Experiments on real AS graph Degree distribution Hop plot Adjacency matrix eigen values Network value NGDM 2007 C. Faloutsos 45

Conclusions Kronecker graphs have: All the static properties Heavy tailed degree distributions Small diameter All the temporal properties Densification Power Law Shrinking/Stabilizing Diameters We can formally prove these results Multinomial eigenvalues and eigenvectors NGDM 2007 C. Faloutsos 46

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to generate realistic graphs TOOLS Problem#4: Who is the master-mind? Problem#5: Fraud detection NGDM 2007 C. Faloutsos 47

Problem#4: MasterMind CePS w/ Hanghang Tong, KDD 2006 htong <at> cs.cmu.edu NGDM 2007 C. Faloutsos 48

Center-Piece Subgraph(Ceps) Given Q query nodes Find Center-piece ( b ) App. Social Networks Law Inforcement, B Idea: Proximity -> random walk with restarts NGDM 2007 C. Faloutsos 49 A C

Case Study: AND query R. Agrawal Jiawei Han V. Vapnik M. Jordan NGDM 2007 C. Faloutsos 50

Case Study: AND query 15 H.V. Jagadish 10 Laks V.S. Lakshmanan 13 R. Agrawal Jiawei Han 10 Heikki 1 1 Mannila 2 6 1 Christos Padhraic 1 1 Faloutsos Smyth 1 V. Vapnik 3 M. Jordan 1 4 Corinna 6 Daryl Cortes Pregibon NGDM 2007 C. Faloutsos 51

Case Study: AND query 15 H.V. Jagadish 10 Laks V.S. Lakshmanan 13 R. Agrawal Jiawei Han 10 Heikki 1 1 Mannila 2 6 1 Christos Padhraic 1 1 Faloutsos Smyth 1 V. Vapnik 3 M. Jordan 1 4 Corinna 6 Daryl Cortes Pregibon NGDM 2007 C. Faloutsos 52

B Conclusions Q1:How to measure the importance? A1: RWR+K_SoftAnd Q2:How to do it efficiently? A2:Graph Partition (Fast CePS) ~90% quality 150x speedup (ICDM 06) A C NGDM 2007 C. Faloutsos 53

Motivation Data mining: ~ find patterns (rules, outliers) Problem#1: How do real graphs look like? Problem#2: How do they evolve? Problem#3: How to generate realistic graphs TOOLS Problem#4: Who is the master-mind? Problem#5: Fraud detection NGDM 2007 C. Faloutsos 54

E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU NGDM 2007 C. Faloutsos 55

E-bay Fraud detection lines: positive feedbacks would you buy from him/her? NGDM 2007 C. Faloutsos 56

E-bay Fraud detection lines: positive feedbacks would you buy from him/her? or him/her? NGDM 2007 C. Faloutsos 57

E-bay Fraud detection - NetProbe NGDM 2007 C. Faloutsos 58

OVERALL CONCLUSIONS Graphs pose a wealth of fascinating problems self-similarity and power laws work, when textbook methods fail! New patterns (shrinking diameter!) New generator: Kronecker NGDM 2007 C. Faloutsos 59

Reaching out Promising directions Sociology, epidemiology; physics, ++ Computer networks, security, intrusion det. Num. analysis (tensors) time IP-source IP-destination NGDM 2007 C. Faloutsos 60

Promising directions cont d Scaling up, to Gb/Tb/Pb Storage Systems Parallelism (hadoop/map-reduce) NGDM 2007 C. Faloutsos 61

E.g.: self-* system @ CMU >200 nodes 40 racks of computing equipment 774kw of power. target: 1 PetaByte goal: self-correcting, selfsecuring, self-monitoring, self-... NGDM 2007 C. Faloutsos 62

DM for Tera- and Peta-bytes Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such infrastructures become self-healing, self-adjusting, self-* NGDM 2007 C. Faloutsos 63

References Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong. Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA NGDM 2007 C. Faloutsos 64

References Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award). Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, 2005. NGDM 2007 C. Faloutsos 65

References Jure Leskovec and Christos Faloutsos, Scalable Modeling of Real Graphs using Kronecker Multiplication, ICML 2007, Corvallis, OR, USA Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang and Christos Faloutsos NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks WWW 2007, Banff, Alberta, Canada, May 8-12, 2007. Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA NGDM 2007 C. Faloutsos 66

References Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. [pdf] NGDM 2007 C. Faloutsos 67

Contact info: www. cs.cmu.edu /~christos (w/ papers, datasets, code, etc) NGDM 2007 C. Faloutsos 68