Uncertainty and Randomization

Similar documents
The PageRank Computation in Google: Randomization and Ergodicity

Randomization and Gossiping in Techno-Social Networks

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Graph Models The PageRank Algorithm

A Note on Google s PageRank

1998: enter Link Analysis

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Mining PageRank. From Stanford C246

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

0.1 Naive formulation of PageRank

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

eigenvalues, markov matrices, and the power method

Link Analysis. Stony Brook University CSE545, Fall 2016

ECEN 689 Special Topics in Data Science for Communications Networks

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Calculating Web Page Authority Using the PageRank Algorithm

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Computing PageRank using Power Extrapolation

Online Social Networks and Media. Link Analysis and Web Search

Slides based on those in:

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Analysis Ranking

Application. Stochastic Matrices and PageRank

IR: Information Retrieval

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis. Leonid E. Zhukov

Math 304 Handout: Linear algebra, graphs, and networks.

How does Google rank webpages?

Page rank computation HPC course project a.y

Applications. Nonnegative Matrices: Ranking

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Node Centrality and Ranking on Networks

Mathematical Properties & Analysis of Google s PageRank

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

Google Page Rank Project Linear Algebra Summer 2012

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lecture: Local Spectral Methods (1 of 4)

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Data Mining and Matrices

Data and Algorithms of the Web

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Krylov Subspace Methods to Calculate PageRank

MATH36001 Perron Frobenius Theory 2015

Algebraic Representation of Networks

Robust PageRank: Stationary Distribution on a Growing Network Structure

Applications of The Perron-Frobenius Theorem

CPSC 540: Machine Learning

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Online Social Networks and Media. Link Analysis and Web Search

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Node and Link Analysis

Inf 2B: Ranking Queries on the WWW

The Second Eigenvalue of the Google Matrix

On the mathematical background of Google PageRank algorithm

Analysis of Google s PageRank

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

The Google Markov Chain: convergence speed and eigenvalues

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS6220: DATA MINING TECHNIQUES

Data Mining Recitation Notes Week 3

Updating PageRank. Amy Langville Carl Meyer

Computational Economics and Finance

CPSC 540: Machine Learning

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

A hybrid reordered Arnoldi method to accelerate PageRank computations

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Lecture 12: Link Analysis for Web Retrieval

CS249: ADVANCED DATA MINING

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Markov Chains, Random Walks on Graphs, and the Laplacian

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Web Structure Mining Nodes, Links and Influence

CONVERGENCE ANALYSIS OF A PAGERANK UPDATING ALGORITHM BY LANGVILLE AND MEYER

Finding central nodes in large networks

Random Surfing on Multipartite Graphs

arxiv:cond-mat/ v1 3 Sep 2004

Google and Biosequence searches with Markov Chains

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

COMPSCI 514: Algorithms for Data Science

Google Matrix, dynamical attractors and Ulam networks Dima Shepelyansky (CNRS, Toulouse)

STA141C: Big Data & High Performance Statistical Computing

Markov Chains for Biosequences and Google searches

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

1 Searching the World Wide Web

Lecture 7 Mathematics behind Internet Search

The Push Algorithm for Spectral Ranking

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Analysis and Computation of Google s PageRank

Transcription:

Uncertainty and Randomization The PageRank Computation in Google Roberto Tempo IEIIT-CNR Politecnico di Torino tempo@polito.it

1993: Robustness of Linear Systems

1993: Robustness of Linear Systems

16 Years Later still working on uncertain systems for modern applications (using different methodologies) On the theory side, randomization is now used in the control community as a powerful tool for solving problems otherwise intractable

16 Years Later We are also dealing with very different applications

16 Years Later We are also dealing with very different applications The PageRank Problem in Google Random Surfer Model and Teleportation Matrix A Distributed Randomized Algorithm for PageRank Uncertain Systems, Control and Randomization

The PageRank Problem in Google

Random Surfer Model Web surfer moves along randomly following the hyperlink structure When arriving at a page with several outgoing links, one is chosen at random, then the random surfer hyperlinks to the new one, and so on The time the random surfer spends on a page is a measure of the importance of the page If important pages point to your page, then your page becomes important. Need to rank the pages for facilitating the web search

Graph Representation 1 2 3 5 4 Directed graph with nodes (pages) and links representing the web Graph is not necessarily strongly connected Graph is constructed using crawlers and spiders which move continuously along the web

Hyperlink Matrix 1 2 3 4 For each node we count the number of outgoing links and normalize them to 1 Hyperlink matrix is a nonnegative (column) substochastic matrix 5

Hyperlink Matrix 1 2 3 5 4 0 1 0 0 0 1/3 0 0 1/2 0 A = 1/3 0 0 1/2 0 1/3 0 0 0 0 0 0 1 0 0

PageRank: Bringing Order to the Web [1,2] Need to rank pages in order of importance The PageRank x* is defined as x*=ax* where x* [0,1] n and i x i *=1 x* is a nonnegative unit eigenvector corresponding to the eigenvalue 1 for the hyperlink matrix A The question is when x* exists and it is unique [1] S. Brin, L. Page (1998) [2] S. Brin, L. Page, R. Motwani, T. Winograd (1999)

Issue of Dangling Nodes First issue:wehavedangling nodes Random surfer gets stuck when visiting a pdf file In this case the back button of the browser is used Mathematically, the hyperlink matrix is nonnegative and (column) substochastic Easy fix: Add artificial links to make the matrix stochastic

1 2 Easy Fix Page 5 is a dangling node We add an outgoing link 3 4 A 0 1 0 0 0 1/3 0 0 1/2 0 = 1/3 0 0 1/2 1 5 1/3 0 0 0 0 0 0 1 0 0

But in General the Fix is not so Easy 1 2 Page 5 has two incoming links 3 4 5

But in General the Fix is not so Easy 1 2 We add an outgoing link from 5 to 3 3 4 A 0 1 0 0 0 1/3 0 0 1/3 0 = 1/3 0 0 1/3 1 5 1/3 0 0 0 0 0 0 1 1/3 0

But in General the Fix is not so Easy 1 2 or we add an outgoing link from 5 to 4? 3 4 A 0 1 0 0 0 1/3 0 0 1/3 0 = 1/3 0 0 1/3 0 5 1/3 0 0 0 1 0 0 1 1/3 0

1 2 3 4 Modified Hyperlink Matrix A solution may be to break page 5 into two pages 5a and 5b This artificially changes the number of pages (not only the number of links) and the topology of the network 5a 5b

Modified Hyperlink Matrix 1 2 0 1 0 0 0 0 1/3 0 0 1/3 0 0 1/3 0 0 1/3 1 0 A = 4 1/3 0 0 0 0 1 3 0 0 1 0 0 0 0 0 0 1/3 0 0 5a 5b

Assumption: No Dangling Nodes We assume that there are no dangling nodes This implies that A is a nonnegative stochastic matrix (instead of substochastic) having at least one eigenvalue equal to one This eigenvalue is not necessarily unique

Teleportation Matrix Second issue: The random surfer may get bored after a while, and decides to jump to another page not directly connected to that currently visited Instead of A we consider a matrix M defined as M =(1-m) A + m/n S m (0,1) where S is a matrix with all entries equal to 1 and n is the number of pages The value m = 0.15 is proposed and used at Google [1] [1] S. Brin, L. Page (1998)

Matrix M and Perron-Frobenius Theorem M is positive stochastic (convex combination of two stochastic matrices and m (0,1)) M is irreducible and the corresponding graph is strongly connected (every page is directly connected to every page) M is primitive because M k is positive for some k (k =1) Perron-Frobenius Theorem: For a positive stochastic matrix M there exists a unique positive eigenvector for the eigenvalue 1

PageRank Computation

PageRank Computation PageRank is computed with the power method x(k+1) = Mx(k) Convergence of this recursion is guaranteed by Perron- Frobenius Theorem for any initial condition x(0) because M is a positive stochastic matrix x(k) x* provided that i x i (0) = 1 for k Remark: PageRank computation can be interpreted as finding the stationary point of a Markov Chain

PageRank Computation with Power Method 1 4 2 3 0 0 0 1/3 1 0 1/2 1/3 A = 0 1/2 0 1/3 0 1/2 1/2 0 m = 0.15 0.038 0.037 0.037 0.321 0.887 0.037 0.462 0.321 M = 0.037 0.462 0.037 0.321 0.037 0.462 0.462 0.037 x * = [ 0.12 0.33 0.26 0.29] T

Size of the Web ThesizeofM is 8 billion! The PageRank computation requires 50-100 iterations This takes about a week and it is performed centrally at Google once a month More and more computing power is needed

Columbia River, The Dalles, Oregon

Randomized Decentralized Approach

Randomized Decentralized Approach Main idea: Develop a decentralized approach for computing PageRank (instead of a centralized approach which involves the entire web)

Randomized Decentralized Approach Main idea: Develop a decentralized approach for computing PageRank (instead of a centralized approach which involves the entire web) Approach is randomization-based (Las Vegas type)

Basic Communication Protocol 1 4 Basic communication protocol: at time k the randomly selected page i initiates the PageRank update as follows: 2 3

Basic Communication Protocol 1 4 Basic communication protocol: at time k the randomly selected page i initiates the PageRank update as follows: 2 3

Basic Communication Protocol 1 4 2 3 Basic communication protocol: at time k the randomly selected page i initiates the PageRank update as follows: 1. by sending the value of page i to the outgoing pages that are linked to i

Basic Communication Protocol 1 4 2 3 Basic communication protocol: at time k the randomly selected page i initiates the PageRank update as follows: 1. by sending the value of page i to the outgoing pages that are linked to i 2. by requesting their values from the incoming pages that are linked to page i

Las Vegas Randomized Approach The pages taking action are determined via a random process θ(k) At time k page i initiates PageRank update with uniform probability Prob{θ(k)=i} =1/n

Distributed Randomized Update Scheme We consider the randomized update scheme x(k+1) = A θ(k) x(k) where A θ(k) are the distributed link matrices (example next) Consider the time average y(k) =1/(k+1) i x (i)

Distributed Link Matrices - 1 A 0 0 0 1/3 1 0 1/2 1/3 = 0 1/2 0 1/3 0 1/2 1/2 0 A 4 1/3 1/3 = 1/3 0 1/2 1/2 0

Distributed Link Matrices - 2 A 0 0 0 1/3 1 0 1/2 1/3 = 0 1/2 0 1/3 0 1/2 1/2 0 A 4 0 0 1/3 0 0 1/3 = 0 0 1/3 0 1/2 1/2 0

Distributed Link Matrices - 3 A 0 0 0 1/3 1 0 1/2 1/3 = 0 1/2 0 1/3 0 1/2 1/2 0 A 4 1 0 0 1/3 0 1/2 0 1/3 = 0 0 1/2 1/3 0 1/2 1/2 0

Distributed Link Matrices - 4 A 0 0 0 1/3 1 0 1/2 1/3 = 0 1/2 0 1/3 0 1/2 1/2 0 A 3 1 0 0 0 0 1/2 1/2 0 = 0 1/2 0 1/3 0 0 1/2 2/3 A 4 1 0 0 1/3 0 1/2 0 1/3 = 0 0 1/2 1/3 0 1/2 1/2 0

Distributed Link Matrices - 5 A 0 0 0 1/3 0 0 0 1/3 1 0 1/2 1/3 = 1 1 0 0 0 1/2 0 1/3 A1 = 0 0 1 0 0 1/2 1/2 0 0 0 0 2/3 A 2 0 0 0 0 1 0 1/2 1/3 = 0 1/2 1/2 0 0 1/2 0 2/3

Modified Distributed Update Scheme Recall that we need to work with positive stochastic matrices We consider the modified distributed update scheme x(k+1) = M θ(k) x(k) where M θ(k) are the modified distributed link matrices computed as M i =(1-r) A i + r/n S i=1,2,,n and r (0,1) is a design parameter

Theorem [1] Theorem: Take r =2m/(n mn +2m) (0,1) Using the modified distributed update scheme the PageRank is obtained through the time average y E[ y(k) - x* 2 ] 0 for k provided that i x i (0) = 1 Proof: Based on the theory of ergodic matrices Remark: The algorithm is a LVRA [1] H. Ishii, R. Tempo (2008)

Technical Comments The average y(k) can be computed recursively in terms of y(k-1) Sparsity of the matrix A i can be preserved because x(k+1) = M i x(k) = (1- r) A i x(k) + r/n 1 where 1 is a vector with all entries equal to one Recursion x(k+1) = (1- r) A 1 x(k) can be carried on as Convergence rate is 1/k x 3 (k+1) = (1- r) x 3 (k) z(k+1) = (1- r) Bz(k) z=[x 1, x 2, x 4 ] T

More Deeply into Distributed Randomized Schemes - 1 Different update schemes based only on outgoing links (not incoming): similar convergence results Stopping criteria to compute approximately PageRank Need to improve convergence because power method is exponential ( λ 2 1-m )

More Deeply into Distributed Randomized Schemes - 2 Simultaneous random update of multiple webpages

More Deeply into Distributed Randomized Schemes - 2 Simultaneous random update of multiple webpages

More Deeply into Distributed Randomized Schemes - 3 Webpage aggregation and clustering original web aggregated web

Uncertain Systems, Control, Randomization and Google

PageRank and Consensus For consensus problems (stochastic version) we consider a graph representing a network of agents x i Consensus All agent values become equal Graph is strongly connected Convergence w.p.1 for all x i, x j x i (k) x j (k) 0, k Matrices A i are row stochastic PageRank Page values converge to constant Web is not strongly connected MSE convergence for y E[ y(k)-x* 2 ] 0, k Matrices A i are column stochastic

PageRank and Uncertain Systems PageRank computation in the presence of uncertain, time-varying and broken links (LP solution)

Acknowledgment Acknowledgment: Research on PageRank is joint work with Hideaki Ishii Subsequent work with Er-Wei Bai, Fabrizio Dabbene, Shinji Hara http://staff.polito.it/roberto.tempo/