Graph Analysis Using Map/Reduce

Similar documents
Tall-and-skinny! QRs and SVDs in MapReduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS6220: DATA MINING TECHNIQUES

Google Page Rank Project Linear Algebra Summer 2012

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS6220: DATA MINING TECHNIQUES

CS425: Algorithms for Web Scale Data

CS249: ADVANCED DATA MINING

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Data Mining and Analysis: Fundamental Concepts and Algorithms

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Page rank computation HPC course project a.y

Large-Scale Behavioral Targeting

Knowledge Discovery and Data Mining 1 (VO) ( )

Data and Algorithms of the Web

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Data Mining Techniques

Spatial Analytics Workshop

CS224W: Methods of Parallelized Kronecker Graph Generation

ECEN 689 Special Topics in Data Science for Communications Networks

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

STATISTICAL PERFORMANCE

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Constraint-based Subspace Clustering

Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection

LOCALITY SENSITIVE HASHING FOR BIG DATA

Updating PageRank. Amy Langville Carl Meyer

Efficient Subgraph Matching by Postponing Cartesian Products. Matthias Barkowsky

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Protein Complex Identification by Supervised Graph Clustering

A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs

Slides based on those in:

Web Structure Mining Nodes, Links and Influence

Introduction to Data Mining

Supplemental Material for TKDE

Spatial Extension of the Reality Mining Dataset

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Project 2: Hadoop PageRank Cloud Computing Spring 2017

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Node similarity and classification

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

CS246 Final Exam. March 16, :30AM - 11:30AM

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

1 [15 points] Search Strategies

Singular Value Decomposition

Rainfall data analysis and storm prediction system

Link Mining PageRank. From Stanford C246

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Link Analysis Ranking

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Using R for Iterative and Incremental Processing

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Week 2. The Simplex method was developed by Dantzig in the late 40-ties.

A hybrid reordered Arnoldi method to accelerate PageRank computations

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Data Mining and Analysis: Fundamental Concepts and Algorithms

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale Distributed Networks

Development of a tool for proximity applications

Determining the Diameter of Small World Networks

Lower Bounds for q-ary Codes with Large Covering Radius

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Design of Manufacturing Systems Manufacturing Cells

CS 347 Parallel and Distributed Data Processing

Overlapping Communities

Running Time. Assumption. All capacities are integers between 1 and C.

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Algorithm Design Strategies V

Heuristic Search Algorithms

Fast and Scalable Distributed Boolean Tensor Factorization

Using first-order logic, formalize the following knowledge:

1 Matrix notation and preliminaries from spectral graph theory

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Classification Based on Logical Concept Analysis

Exercises - Solutions

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

ab initio Electronic Structure Calculations

How does Google rank webpages?

Computational Frameworks. MapReduce

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

CSCE 561 Information Retrieval System Models

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Data-Intensive Computing with MapReduce

Online Social Networks and Media. Link Analysis and Web Search

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Lecture: Local Spectral Methods (1 of 4)

On Top-k Structural. Similarity Search. Pei Lee, Laks V.S. Lakshmanan University of British Columbia Vancouver, BC, Canada

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Privacy-Preserving Data Mining

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Online Social Networks and Media. Link Analysis and Web Search

Scalable Distributed Subgraph Enumeration

Chapter DM:II (continued)

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

HEigen: Spectral Analysis for Billion-Scale Graphs

Degree Distribution: The case of Citation Networks

Preserving sparsity in dynamic network computations

Transcription:

Seminar: Massive-Scale Graph Analysis Summer Semester 2015 Graph Analysis Using Map/Reduce Laurent Linden s9lalind@stud.uni-saarland.de May 14, 2015

Introduction Peta-Scale Graph + Graph mining + MapReduce --------------------- = PEGASUS 2

Related Work Page Rank (Google) Random Walk w/ Restart (Pan et al.) Diameter Estimation (Kang et al.) Connected Components (Kang et al.) 3

Problems No unified solution Individual optimizations Hard to understand 4

Ideas Find common primitive Translate mining tasks Solve generalized problem Optimize solution 5

PeGaSus Implements common primitive Users specify: stopping criterion (e.g. y t+1 = y t ) 3 operations (combine2, combineall, assign) Provides MR Jobs for PR, RWR, DE, CC 6

Generalized Iterative Matrix-Vector mult. Graph = Adjacency matrix 11 e 12 e 1n E=[e e 21 e 2n e n1 e n2 e nn] Graph Mining Generalized Matrix-Vector multiplication Iterative solution finding n y i = j=1 e i, j y j y t +1 =E G y t 7

GIM-V Combine2(ei,j, y j t) CombineAlli ({...}) Assign(yi t, y i t+1) n y t +1 = i j=1 e i, j y j t In general: y i t +1 =assign( y i t,combineall i ({ z j j=1.. n, and z j =combine2(e i, j, y j t )})) 8

Example: Page Rank Rank pages by relevance Start out with normal distribution Compute next step via Power Method Source: wikipedia.org/wiki/pagerank 9

Example: Page Rank E column-normalized adjacency matrix U uniform matrix c damping factor E=(e i, j ) U=( 1 n ) i, j 0 c 1 p page rank vector Satisfies the equation p=(c E+(1 c) U ) p Iterative scheme p i 0 = 1 n p t+1 =M G p t 10

Example: Page Rank M p t +1 =(c E T +(1 c) U ) G p t combine2(e i, j, p t t j )=c e i, j p j combineall i ({ x 1,..., x n })= (1 c) n n + j=1 x j assign( p i t, p i t +1 )=p i t +1 p i t +1 =assign( p i t,combineall i ({ x j j=1..n, and x j =combine2(m i, j, p j t )})) 11

Example: Random Walks w/ Restart Measures proximity of nodes Start with arbitrary distribution Refine distribution until results converge Source: Pan et al, Fast Random Walk with Restart and its Application 12

Example: Random Walks w/ Restart E column-normalized adjacency matrix ek vector with k-th component = 1 c restart probability E=(e i, j ) e k =(0,...,0,1,0,...,0) 0 c 1 rk proximity vector Satisfies the equation Iterative Scheme r k =c E r k +(1 c) e k r k 0 =initialdistribution r k t +1 =M G r k t 13

Example: Random Walks w/ Restart M r k t +1 =(c E r k +(1 c) e k ) G r k t combine2(e i, j,r t t j )=c e i, j r j n combineall i ({ x 1,..., x n })=(1 c) I(i=k)+ j =1 assign(r i t,r i t +1 )=r i t+1 x j r k t +1 =assign (r k t,combineall i ({ x j j=1..n, and x j =combine2(m i, j,r k t )})) 14

Example: Diameter Estimation Estimate radius and diameter of subgraphs Radius of a node #Hops to farthest-away node Diameter of a graph Maximal length of shortest path between every pair of nodes 1 2 3 4 Radius(3) = 2 5 1 2 3 4 Diameter = 3 15

Example: Diameter Estimation M adjacency matrix M=(m i, j ) bi probabilistic bitstring Solution satisfies equation b i h+1 =b i h (for all i) Iterative Scheme b i 0 =0...010...0 b i h+1 =b i h BITWISE-OR{b k h (i, k) E } 16

Example: Diameter Estimation b i h+1 =M G b i h combine2(m i, j,b j h )=m i, j b j h combineall i ({ x 1,..., x n })=BITWISE-OR { x j j=1..n} assign(b i h,b i h+1 )=BITWISE-OR(b i h,b i h +1 ) b i h +1 =assign(b i h,combineall i ({ x j j=1..n, and x j =combine2(m i, j,b j h )})) 17

Example: Connected Components Find connected components Assign weight to nodes Smallest weight gets propagated to neighbors Terminate when result does not change anymore 5 6 1 2 3 4 2 4 1 2 2 4 18

Example: Connected Components M adjacency matrix c minimum node vector Solution satisfies equation M =(m i, j ) c i h+1 =c i h (for all i) Iterative scheme c i 0 =i c i h+1 =MIN{ c j h j E} 19

Example: Connected Components c i h+1 =M G c i h combine2(m i, j, c j h )=m i, j c j h combineall i ({ x 1,..., x n })=MIN { x j j=1..n} assign(c i h,c i h+1 )=MIN(c i h,c i h+1 ) c i h+1 =assign(c i h,combineall i ({ x j j=1..n, and x j =combine2(m i, j,c j h )})) 20

GIM-V BASE Input Edge file E(id src,id dst,mval) Vector file V(id,vval) Output Vector file V(id,vval) 2 Stages (MR Jobs) Stage1: Applies combine2 Stage2: Applies combineall, assign 21

GIM-V BASE (Stage 1) Input: M = {(id src, (id dst, mval))} Output: V' = {(id src,combine2(mval,vval))} V = {(id, vval)} Mapper(k, v): (k,v) of type V: (id,vval) (k,v) of type M: (id src,(id dst,mval))... (id,vval) (id dst,(id src,mval)) Reducer(k, v[1..m]): Extract vval from v: (k,( self,vval)) Extract (id src,mval) pairs from v: store pairs in list A foreach (id src, mval) in A: x = combine2(mval, vval) (id src,( others,x)) 22

GIM-V BASE (Stage 2) Input: V' = {(id src,vval)} Output: V = {(id src,vval)} Mapper(k, v): (k, v) (k, v) Reducer(k, v[1..m]): Extract vval from v Extract Stage1 output from v: Store values in list A c = combineall k (A) a = assign(vval,c) (k, a) 23

GIM-V BASE (Stage 1) Mapper Reducer for id=0 (0, 0.15) (1, 0.25)... (0, 0.15) (1, 0.25) Reducer for id=1 (1, [0.25, (0, 0.5),...]) (1,( self,0.25)) (0, (1, 0.5)) (1, (0, 0.5)) (1, (n, 0.7))... (n, (1, 0.7)) Reducer for id=n (1,( others, combine2(0.5,0.25)))...... 24

GIM-V BASE (Stage 2) Mapper (identity) Reducer for id=0......... (0, ( self,0.15)) (1, ( self, 0.25)) Reducer for id=1 (1, [( self, 0.25), ( others, 0.3),...])... (1, ( others, 0.3)) (1,assign(0.25,combineAll 1 ([0.3,...])))...... (n, ( others, 0.7)) Reducer for id=n...... 25

GIM-V BASE Bottlenecks Network traffic Disk IO Number of iterations 26

GIM-V BL (Block Multiplication) Groups elements into blocks M, V can be joined block-wise not element-wise Guarantees co-location of nearby edges Groups of 2 x 2 blocks saved in a single line (row block,col block,row e1,col e1,val e1,...) (id block,id e1,vval e1,...) 27

GIM-V CL (Clustered Edges) Clusters connected vertices into the same blocks Move connected vertices close to each other No need to serialize 0 blocks anymore Preprocessing step (one time cost) Only useful in conjunction with GIM-V BL 28

GIM-V DI (Diagonal Block Iteration) Iterate GIM-V on block level Repeated block multiplication Repeated GIM-V application Block multiplication cheaper then GIM-V Only useful in conjunction with GIM-V BL 29

GIM-V NR (Node Renumbering) 5 4 6 1 5 1 6 4 3 2 7 3 2 7 Renumber nodes for better convergence Find center node via heuristics (highest degree) Renumber center node to value of minimum node Preprocessing step (one time cost) 30

Evaluation Evaluation of GIM-V Performance & Scalability Evaluation of optimizations Relative to GIM-V BASE 31

Experimental Setup Yahoo M45 4,000 processors 3 TB of memory 1.5 PB of disk space 27 teraflops (peak performance) Datasets Kronecker (177 K nodes, 1977 M edges) LinkedIn (7.5 M nodes, 58 M edges) Wikipedia (4.4 M nodes, 27 M edges) And many more (not shown here) 32

PR & GIM-V Evaluation Dataset: Kronecker Notice the impact block multiplication has on performance 33

CC & GIM-V Evaluation Dataset: Kronecker Notice the big reduction in #iterations 34

DE & GIM-V Evaluation LinkedIn Wikipedia Notice the reduction in #iterations 35

Conclusion Feasability of Peta-Scale Graph analysis Identification of common primitive Unification of GM algorithms 36

Q&A 37