Page rank computation HPC course project a.y

Similar documents
A Note on Google s PageRank

1998: enter Link Analysis

Link Analysis. Leonid E. Zhukov

Link Analysis Ranking

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Mining Recitation Notes Week 3

Google Page Rank Project Linear Algebra Summer 2012

Updating PageRank. Amy Langville Carl Meyer

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Mining PageRank. From Stanford C246

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS6220: DATA MINING TECHNIQUES

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

0.1 Naive formulation of PageRank

Online Social Networks and Media. Link Analysis and Web Search

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Uncertainty and Randomization

Data Mining and Matrices

ECEN 689 Special Topics in Data Science for Communications Networks

Computing PageRank using Power Extrapolation

IR: Information Retrieval

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 12: Link Analysis for Web Retrieval

Data and Algorithms of the Web

CS6220: DATA MINING TECHNIQUES

Graph Models The PageRank Algorithm

The Google Markov Chain: convergence speed and eigenvalues

Link Analysis. Stony Brook University CSE545, Fall 2016

Online Social Networks and Media. Link Analysis and Web Search

How does Google rank webpages?

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Slides based on those in:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

Math 304 Handout: Linear algebra, graphs, and networks.

Monte Carlo methods in PageRank computation: When one iteration is sufficient

CS249: ADVANCED DATA MINING

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

STA141C: Big Data & High Performance Statistical Computing

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Web Structure Mining Nodes, Links and Influence

Calculating Web Page Authority Using the PageRank Algorithm

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Machine Learning CPSC 340. Tutorial 12

Project 2: Hadoop PageRank Cloud Computing Spring 2017

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Computational Economics and Finance

Markov chains (week 6) Solutions

Applications to network analysis: Eigenvector centrality indices Lecture notes

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Introduction to Data Mining

The Second Eigenvalue of the Google Matrix

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Algebraic Representation of Networks

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

A hybrid reordered Arnoldi method to accelerate PageRank computations

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

eigenvalues, markov matrices, and the power method

Inf 2B: Ranking Queries on the WWW

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Randomization and Gossiping in Techno-Social Networks

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Applications. Nonnegative Matrices: Ranking

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Application. Stochastic Matrices and PageRank

Krylov Subspace Methods to Calculate PageRank

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Updating Markov Chains Carl Meyer Amy Langville

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

On the mathematical background of Google PageRank algorithm

Markov Chains Handout for Stat 110

Node and Link Analysis

The Static Absorbing Model for the Web a

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

The Theory behind PageRank

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

The PageRank Computation in Google: Randomization and Ergodicity

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

1 Searching the World Wide Web

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Complex Social System, Elections. Introduction to Network Analysis 1

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

COMPSCI 514: Algorithms for Data Science

Finding central nodes in large networks

Transcription:

Page rank computation HPC course project a.y. 2015-16 Compute efficient and scalable Pagerank MPI, Multithreading, SSE 1

PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet search engine, which assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set [wikipedia] [1] Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia 2

PageRank: the intuitive idea PageRank relies on the democratic nature of the Web by using its vast link structure as an indicator of an individual page's value or quality. PageRank interprets a hyperlink from page x to page y as a vote, by page x, for page y However, PageRank looks at more than the sheer number of votes; it also analyzes the page that casts the vote. Votes casted by important pages weigh more heavily and help to make other pages more "important" This is exactly the idea of rank prestige in social network. 3

More specifically A hyperlink from a page to another page is an implicit conveyance of authority to the target page. The more in-links that a page i receives, the more prestige the page i has. Pages that point to page i also have their own prestige scores. A page of a higher prestige pointing to i is more important than a page of a lower prestige pointing to i In other words, a page is important if it is pointed to by other important pages 4

PageRank algorithm According to rank prestige, the importance of page i (i s PageRank score) is the sum of the PageRank scores of all pages that point to i Since a page may point to many other pages, its prestige score should be shared. The Web as a directed graph G = (V, E) The PageRank score of the page i (denoted by P(i)) is defined by: P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 5

PageRank algorithm: iterative process 1/2 P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 6

PageRank algorithm: iterative process 1/3 1/2 1/3 1/3 P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 7

PageRank algorithm: First iteration 1/3 1/2 1/2 1/6 P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 8

PageRank algorithm: Second iteration 5/12 1/2 1/3 1/4 P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 9

PageRank algorithm: Third itera*on 9/24 1/2 11/24 1/6 P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 10

PageRank algorithm: iterative process 2/5 1/2 2/5 1/5 11

Matrix notation Let n = V be the total number of pages We have a system of n linear equations with n unknowns. We can use a matrix to represent them. Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1), P(2),, P(n)) T Let A be the adjacency matrix of our graph with A ij = # % $ &% 1 O i if (i, j) E 0 otherwise P( j) We can write the n equations P(i) = with P = A T P (PageRank) ( j,i) E O j 12

Solve the PageRank equation P = A T P This is the characteristic equation of the eigensystem, where the solution to P is an eigenvector with the corresponding eigenvalue of 1 It turns out that if some conditions are satisfied, 1 is the largest eigenvalue and the PageRank vector P is the principal eigenvector. A well known mathematical technique called power iteration can be used to find P Problem: the above Equation does not quite suffice because the Web graph does not meet the conditions. 13

Using Markov chain To introduce these conditions and the enhanced equation, let us derive the same above Equation based on the Markov chain. In the Markov chain, each Web page or node in the Web graph is regarded as a state. A hyperlink is a transition, which leads from one state to another state with a probability. This framework models Web surfing as a stochastic process. Random walk It models a Web surfer randomly surfing the Web as state transition. 14

Random surfing Recall we use O i to denote the number of out-links of a node i Each transition probability is 1/O i if we assume the Web surfer will click the hyperlinks in the page i uniformly at random. the back button on the browser is not used the surfer does not type in an URL 15

Transition probability matrix Let A be the state transition probability matrix: " $ $ $ A = $. $ $ $ # A 11 A 12... A 1n % ' A 21 A 22... A 2n '... ' '... '... ' ' A n1 A n 2... A nn & A ij represents the transition probability that the surfer in state i (page i) will move to state j (page j). Can A be the adjacency matrix previously discussed? 16

Let us start Given an initial probability distribution vector that a surfer is at each state (or page) p 0 = (p 0 (1), p 0 (2),, p 0 (n)) T (a column vector) an n n transition probability matrix A we have n p 0 (i) =1 i=1 n A ij =1 (1) j=1 If the matrix A satisfies Equation (1), we say that A is the stochastic matrix of a Markov chain 17

Back to the Markov chain In a Markov chain, a question of common interest is: What is the probability that, after m steps/transitions (with m è ), a random process/walker reaches a state j independently of the initial state of the walk We determine the probability that the random surfer arrives at the state/page j after 1 step (1 transition) by using the following reasoning: p 1 ( j) = n i=1 A ij (1)p 0 (i) where A ij (1) is the probability of going from i to j after 1 step. At beginning p 0 (i) = 1/N for all i 18

State transition We can write this in matricial form: P 1 = A T P 0 In general, the probability distribution after k steps/transitions is: P k = A T P k -1 19

Stationary probability distribution By the Ergodic Theorem of Markov chain a finite Markov chain defined by the stochastic matrix A has a unique stationary probability distribution if A is irreducible and aperiodic The stationary probability distribution means that after a series of transitions p k will converge to a steady-state probability vector π, i.e., lim P = π k k 20

PageRank again When we reach the steady-state, we have P k = P k+1 =π, and thus π =A T π π is the principal eigenvector (the one with the maximum magnitude) of A T with eigenvalue of 1 In PageRank, π is used as the PageRank vector P: P = A T P 21

Is P = π justified? Using the stationary probability distribution π as the PageRank vector is reasonable and quite intuitive because it reflects the long-run probabilities that a random surfer will visit the pages. a page has a high prestige if the probability of visiting it is high 22

Back to the Web graph Now let us come back to the real Web context and see whether the above conditions are satisfied, i.e., whether A is a stochastic matrix and whether it is irreducible and aperiodic. None of them is satisfied. Hence, we need to extend the ideal-case to produce the actual PageRank model. 23

A is a not stochastic matrix A is the transition matrix of the Web graph A ij = # % $ &% It does not satisfy equation: 1 O i if (i, j) E 0 otherwise n j=1 A ij =1 because many Web pages have no out-links (dangling pages) This is reflected in transition matrix A by some rows of 0 s 24

An example Web hyperlink graph " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 0 0 0 0 0 0 ' $ ' # 0 0 0 1 2 1 2 0 & 25

Fix the problem: two possible ways 1. Remove pages with no out-links during the PageRank computation these pages do not affect the ranking of any other page directly 2. Add a complete set of outgoing links from each such page i to all the pages on the Web. Let us use the 2 nd method: " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 1 6 1 6 1 6 1 6 1 6 1 6' $ ' # 0 0 0 1 2 1 2 0 & 26

A is a not irreducible Irreducible means that the Web graph G is strongly connected Definition: A directed graph G = (V, E) is strongly connected if and only if, for each pair of nodes u, v V, there is a directed path from u to v. A general Web graph represented by A is not irreducible because for some pair of nodes u and v, there is no path from u to v In our example, there is no directed path from nodes 3 to 4 27

A is a not aperiodic A state i in a Markov chain being periodic means that there exists a directed cycle (from i to i) that a random walker traverses multiple times Definition: A state i is periodic (with period k > 1) if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k A Markov chain is aperiodic if all states are aperiodic. 28

An example: periodic This a periodic Markov chain with k = 3 If we begin from state 1, to come back to state 1 the only path is 1-2-3-1 for some number of times, say h Thus any return to state 1 will take k h = 3h transitions. 29

Deal with irreducible and aperiodic matrices It is easy to deal with the above two problems with a single strategy. Add a link from each page to every page and give each link a small transition probability controlled by a parameter d Obviously, the augmented transition matrix becomes irreducible and aperiodic it becomes irreducible because it is strongly connected it become aperiodic because we now have paths of all the possible lengths from state i back to state i 30

Improved PageRank After this augmentation, at a page, the random surfer has two options With probability d, 0<d<1, she randomly chooses an outlink to follow With smaller probability 1-d, she stops clicking and jumps to a random page The following equation models the improved model: P = ((1 d) E n + dat )P where E is a n n square matrix of all 1 s n is important, since the matrix has to be stochastic 31

Follow our example d = 0.9 Transposed matrix (1 d) E n + dat = The matrix made stochastic, which is still: periodic (see state 3) reducible (no path from 3 to 4) " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 1 6 1 6 1 6 1 6 1 6 1 6' $ ' # 0 0 0 1 2 1 2 0 & # 1 60 7 15 1 60 1 60 1 6 1 60& % ( % 7 15 1 60 11 12 1 60 1 6 1 60 ( % 7 15 7 15 1 60 19 60 1 6 1 60( % ( % 1 60 1 60 1 60 1 60 1 6 7 15 ( % 1 60 1 60 1 60 19 60 1 6 7 15( % ( $ 1 60 1 60 1 60 19 60 1 6 1 60' 32

The final PageRank algorithm (1-d)E/n + da T is a stochastic matrix (transposed). It is also irreducible and aperiodic Note that E = e e T where e is a column vector of 1 s e T P = 1 since P is the stationary probability vector π If we scale this equation: P = ((1 d) E n + dat )P = (1 d) 1 n e et P + da T P = by multiplying both sides by n, we have: e T P = n and thus: = (1 d) 1 n e + dat P P = (1 d)e + da T P 33

The final PageRank algorithm (cont ) Given: P = (1 d ) e + da PageRank for each page i is: n j=1 that is equivalent to the formula given in the PageRank paper [BP98] The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank paper T P P(i) = (1 d) + d A ji P( j) P(i) = (1 d) + d [BP98] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW Int.l Conf., 1998. A ji = # % $ & % 1 O j ( j,i) E if ( j,i) E 0 otherwise P( j) O j 34

Compute PageRank Use the power iteration method Initialization 0 n Norm 1 less than 10-6 35

Again PageRank Without scaling the equation (w/o multiplying by n), we have e T P = 1 (i.e., the sum of all PageRanks is one), and thus: P(i) = 1 d n + d P( j) O ( j,i) E j Important pages are cited/pointed by other important ones In the example, the most important is ID=1 P(ID=1) = 0.304 P(ID=1) distributes is rank among all its 5 outgoing links ID= 2, 3, 4, 5, 7 0.304 = 0.061 * 5 36 36

Again PageRank Without scaling the equation (w/o multiplying by n), we have e T P = 1 (i.e., the sum of all PageRanks is one), and thus: P(i) = 1 d n + d P( j) O ( j,i) E j The stationary probability P(ID=1) is obtained by: (1-d)/n + d (0.023+0.166+0.071+0.045)= (0.15)/7 + 0.85(0.023+0.166+0.071+0.045)= 0.304 37 37

1 st step Write a sequential code (C or C++) that implements Pagerank Compile the code with O3 option, and measure the execution times (command time) for some inputs Input graphs: http://snap.stanford.edu/data Test example: 0 1 2 P[2] = 0.474412 P[1] = 0.341171 P[0] = 0.184417 38

2 nd step Given the original incidence matrix A[][], if we know which are the dangling nodes, we can avoid filling zero-rows with values 1/n A T 1 1 1 1 1 1 + 1/n * 0. 0... 0 * p k = p k+1.. 1 1 Dangling nodes A T 1 0 0 1 0.0 1 0.0 + 1/n * 1 1... 1 * * p k = p k+1 39

2 nd step A T * p k + 1/n * 1 1 1... 1 * 0 0 1 0.0 1 0.0 * p k = p k+1 A T * p k + X... i2danglings X i2danglings p k [i] n p k [i] n = p k+1 40

2 nd step Avoid transposing matrix A[][] Still traverse A[][] in row major order for (i=0; i<n; i++) for (j=0; j<n; j++) p_new[j] = p_new[j] + a[i][j] * p[i]; Store matrix A[][] in sparse compressed form Compressed sparse row (CSR or CRS) 41

2 nd step Compressed sparse row (CSR or CRS) Used for traversing matrix in row major order val 10-2 3 9 3 7 8 7 3 8 7 5 8 9 9 13 4 2-1 col_ind 0 4 0 1 5 1 2 3 0 2 3 4 1 3 4 5 1 4 5 row_ptr 0 2 5 8 12 16 19 Start row 0 Position where the n-th row should Start row 1 start. Note that the matrix is sparse, and thus the row could be completely zero. Start row 2 In this case row_ptr[n] = row_prt[n+1] Start row 3 Start row 4 Start row 5 Start row 6 (1 more position) 42

2 nd step Store big data like A[][] on a file Once we map the file to a memory region, we access it via pointers, just as you would access ordinary variables and objects You can mmap specific section/partition of the file, and share the files between more threads #include <stdio.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> int main() { int i; float val; float *mmap_region; FILE *fstream; int fd; 43

2 nd step /* create the file */ fstream = fopen("./mmapped_file", "w+"); for (i=0; i<10; i++) { val = i + 100.0; /* write a stream of binary floats */ fwrite(&val, sizeof(float), 1, fstream); } fclose(fstream); /* map a file to the pages starting at a given address for a given length */ fd = open("./mmapped_file", O_RDONLY); mmap_region = (float *) mmap(0, 10*sizeof(float), PROT_READ, MAP_SHARED, fd, 0); if (mmap_region == MAP_FAILED) { close(fd); printf("error mmapping the file"); exit(1); } close(fd); Starting offset address in the file 44

2 nd step /* Print the data mmapped */ for (i=0; i<10; i++) printf("%f ", mmap_region[i]); printf("\n"); } /* free the mmapped memory */ if (munmap(mmap_region, 10*sizeof(float)) == -1) { printf("error un-mmapping the file"); exit(1); } 45

3rd step The goal of this assignment is to parallelize the optimized code of the 2 assignment You can use SIMD SSE, shared-memory or message passing (also hybrid) parallelization Measure speedup and efficiency as a function of processors/cores exploited (for a couple data sets) Point out the effects of the Amdahl law, concerning the serial sections that remain serial e.g., the input output phases if you are not able to parallelize Measure how the execution time changes when we increase the problem size, without changing the number of processors/cores employed This requires to consider subsets of nodes and edges of a given input graph 46

3rd step The issues to solve concern decision such as the right decomposition, the right granularity, and a strategy (static/dynamic) of task assignment I would only to point out that, if we don t transpose matrix A, and decomposes the problem over the input, we have: A * p k = p k+1 A * p k = p k+1 + reduce p k+1 A * p k = p k+1 47