Google Page Rank Project Linear Algebra Summer 2012

Similar documents
A Note on Google s PageRank

How does Google rank webpages?

Calculating Web Page Authority Using the PageRank Algorithm

Math 304 Handout: Linear algebra, graphs, and networks.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

1998: enter Link Analysis

0.1 Naive formulation of PageRank

Graph Models The PageRank Algorithm

CS6220: DATA MINING TECHNIQUES

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Data Mining Recitation Notes Week 3

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Page rank computation HPC course project a.y

Lecture 7 Mathematics behind Internet Search

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Link Mining PageRank. From Stanford C246

Class President: A Network Approach to Popularity. Due July 18, 2014

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

IR: Information Retrieval

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Online Social Networks and Media. Link Analysis and Web Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Link Analysis Ranking

Application. Stochastic Matrices and PageRank

Updating PageRank. Amy Langville Carl Meyer

CS249: ADVANCED DATA MINING

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis. Leonid E. Zhukov

eigenvalues, markov matrices, and the power method

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Uncertainty and Randomization

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Applications to network analysis: Eigenvector centrality indices Lecture notes

Inf 2B: Ranking Queries on the WWW

Data and Algorithms of the Web

Slides based on those in:

STA141C: Big Data & High Performance Statistical Computing

Link Analysis. Stony Brook University CSE545, Fall 2016

Numerical Methods I: Eigenvalues and eigenvectors

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Lecture 12: Link Analysis for Web Retrieval

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Conceptual Questions for Review

1 Searching the World Wide Web

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

On the mathematical background of Google PageRank algorithm

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Google and Biosequence searches with Markov Chains

Justification and Application of Eigenvector Centrality

Krylov Subspace Methods to Calculate PageRank

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Introduction to Data Mining

Data Mining and Matrices

CS47300: Web Information Search and Management

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Announcements: Warm-up Exercise:

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis and Web Search

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Designing Information Devices and Systems I Fall 2018 Homework 5

Applications. Nonnegative Matrices: Ranking

Data science with multilayer networks: Mathematical foundations and applications

A hybrid reordered Arnoldi method to accelerate PageRank computations

Practice Problems - Linear Algebra

Node and Link Analysis

MATH36001 Perron Frobenius Theory 2015

Markov Chains for Biosequences and Google searches

Algebraic Representation of Networks

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

The Giving Game: Google Page Rank

The Google Markov Chain: convergence speed and eigenvalues

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Where and how can linear algebra be useful in practice.

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Math 307 Learning Goals. March 23, 2010

Eigenvalues and Eigenvectors

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Lecture: Local Spectral Methods (1 of 4)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Appendix 2: Linear Algebra

SUPPLEMENTARY MATERIALS TO THE PAPER: ON THE LIMITING BEHAVIOR OF PARAMETER-DEPENDENT NETWORK CENTRALITY MEASURES

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

APPM 2360 Project 2: Exploring Stage-Structured Population Dynamics with Loggerhead Sea Turtles

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Transcription:

Google Page Rank Project Linear Algebra Summer 2012 How does an internet search engine, like Google, work? In this project you will discover how the Page Rank algorithm works to give the most relevant information as the top hit on a Google search. This document is organized as follows: a brief description of what happens when you query a search engine, a brief description of the Page Rank algorithm (with a few tasks for you) the description of your major tasks for this project 1 A Google Query Search engines compile large indexes of the dynamic information on the internet so they are easily searched. This means that when you do a Google search, you are not actually searching the internet; instead, you are searching the indexes at Google. When you type a query into Google the following two steps take place: 1. Query Module: The query module at Google converts your natural language into a language that the search system can understand, and consults the various indexes at Google in order to answer the query. This is done to find the list of relevant pages. 2. Ranking Module: The ranking module takes the set of relevant pages and ranks them according to some criterion. The outcome is an ordered list of webpages such taht the pages near the top of the list are most likely to be what you desire from your search. This ranking is the same as assigning a popularity score to each web site and then listing the relevant sites by this score. This project focuses on the Linear Algebra behind the Ranking Module developed by Sergey Brin and Larry Page; the PageRank algorithm. 2 The PageRank Algorithm In simple terms: A webpage is important if it is pointed to by other important pages. The internet can be viewed as a directed graph (look up this term on Wikipedia) where the nodes are the webpages and the edges are the hyperlinks between the pages. The hyperinks into a page are called inlinks, and the ones pointing out of a page are called outlinks. In essence, a hyperlink from my page to yours is my endorsement of your page. Thus, a page with more recommendations must be more important than a page with a few links. However, the status of the recommendation is also important. Let us now translate this into mathematical equations. To help understand this we first consider the small web of six pages shown in Figure 2. The links between the pages are shown by arrows. An arrow pointing into a node is an inlink and an arrow pointing out of a node is an outlink. In Figure 2, node has three outlinks (to nodes 1, 2, and 5) and 1 inlink (from node 1). The PageRank of a page P i, denoted r(p i ), is the sum of the PageRanks of all pages pointing into P i r(p i ) = r(p j ) P j P j B Pi (1) 1

1 2 6 5 4 Figure 1: Sample graph of a web with six pages. where B Pi is the set of pages pointing into P i, and P j is the number of outlinks from page P j. This means that the inlinks are weighted against how many pages are linked from a page. This formula has the problem that to find a PageRank of one page you must first know the PageRank of all of the other pages. To overcome this problem we iterate (find successive approximations of r(p i )) using the formula r k+1 (P i ) = r k (P j ). (2) P j P j B Pi In this formula, the subscripts describe the iterate. This process is initiated with r 0 (P i ) = 1/n for all pages P i and repeated with the hope that the PageRank scores will eventually converge to some final stable values. Applying equation (2) to the web in Figure 2 gives the values in Table 1 for the PageRanks after a few iterations. Iteration 0 Iteration 1 Iteration 2 Iteration r 0 (P 1 ) = r 1 (P 1 ) = 1/18 r 2 (P 1 ) = 1/6 r 0 (P 2 ) = r 1 (P 2 ) = 5/6 r 2 (P 2 ) = 1/18 r 0 (P ) = r 1 (P ) = 1/12 r 2 (P ) = 1/6 r 0 (P 4 ) = r 1 (P 4 ) = 1/4 r 2 (P 4 ) = 17/72 r 0 (P 5 ) = r 1 (P 5 ) = 5/6 r 2 (P 5 ) = 11/72 r 0 (P 6 ) = r 1 (P 6 ) = r 2 (P 6 ) = 14/72 Table 1: First few iterates using equation (2) These PageRanks yield the following rankings at each iteration: Task # 1: Iteration 0 Iteration 1 Iteration 2 Iteration 1 5 5 1 4 1 4 5 1 1 1 1 1 2 2 Table 2: Rankings for first few iterates of equation (2) Fill in the tables above with the PageRank and rankings for iteration. The process described here can be simplified using matrices. Define the matrix H as an n n matrix and define x as a n 1 vector. The matrix H is called the hyperlink matrix with 1, if there is a link from node i to node j H ij = P i 0, else 2

Using this notation, H 11 = 0, H 12 = 1/2, H 1 = 1/2, H 14 = 0, H 15 = 0, and H 16 = 0. Task #2: Fill in the remainder of the entries for the H matrix for the web shown in Figure 1. 0 1/2 1/2 0 0 0 H = The vector x is a vector that contains all of the PageRanks. Therefore, Task # x 0 = 1/18 5/6 1/12 and x 1 =. 1/4 5/6 Verify that x 1 = H T x 0, x 2 = H T x 1, and x = H T x 2. (You do not need to show any calculations here. Just do the calculations on the side.) The general task to find the PageRank is to find successive approximations for r(p i ) using x k+1 = H T x k. () If this process reaches a steady state then that steady state is the final PageRank for the web. Technically this process can proceed forever, but in certain instances there is a stationary solution to the iterative process. This means that over time the values in x do not change. In other words, there is a limit. Task #4: Apply this process several times using MATLAB and approximate the stationary vector for this example. x = Given this stationary point solution to (), what are the rankings of the web pages? Task #5: Plot the behavior of the iterates over the first 20 iterations, and explain the plot that you see. (uncomment the code here) (be sure that your image file is saved in the same folder as your tex file, and the file type should be pdf)

Big Project Tasks 1. Prove that the sum of each row of the H-matrix for any web of n pages will always be either 0 or 1. Be sure to indicate the special circumstances when the sum of a row is zero. (you should not simply use the example from Figure 2.) (Soln) +Your proof gets typed here+ 2. Definition: A probability vector is a vector with nonnegative entries that add up to 1. Definition: A stochastic matrix is a square matrix whose columns are probability vectors.. What must be true about a collection of n pages such that H T is a stochastic matrix. (Soln) +Your solution gets typed here+ 4. Consider the web in the figure below. 1 2 7 6 5 8 4 Figure 2: Graph of a web with eight pages. (a) Write the H matrix, the initial state x 0, and the steady state PageRank vector. (Soln) +Your solution gets typed here+ (b) Rank the web pages according to the PageRank vector (Soln) +Your solution gets typed here+ (c) Create a graph of the iterates of the PageRank vector for 50 iterations (Soln) (uncomment the code here) (d) Use MATLAB (or something similar) to find the largest eigenvalue of H T and the eigenvector associated with this eigenvalue. Compare this eigenvector to the PageRank vector from part (a). (keep in mind that MATLAB will always normalize all vectors) (Soln) +Your Solution Gets Typed Here+ 5. Theorem: The largest eigenvalue of every stochastic matrix is 1. (you do not need to prove this) 6. Explain how the fact that H T is a stochastic matrix relates to the PageRank vector, and use this to explain how Google will rank the search results for a web query. Hint: Google does not do an iterative process. (Soln) +Your Solution Gets Typed Here+ 4

A MATLAB Basics The following are some basics for running MATLAB on this project. 1 2 To type in the matrix H = 4 5 6, type the following and press enter: 7 8 9 >> H = [1, 2, ; 4, 5, 6 ; 7, 8, 9] 1 Similarly to type the vector x = 2, type the following and press enter: >> x = [1 ; 2 ; ] To transpose a matrix, use the apostrophe >> H To iterate equation () you can run >> x = H * x over and over again. Each time will produce the result of one iterate. 0.4 To do several iterates quickly you can write the for loop as follows. For this example, assume that x 0 = 0. 0.1 0 0.9 and H = 0 0. 0.7. 0. 0.2 0.5 >> H = [0.1, 0, 0.9; 0, 0., 0.7; 0., 0.2, 0.5]; >> x(:,1) = [0.4; 0.; 0.]; >> for j=2:20, x(:,j) = H * x(:,j-1); end 0. This example will do 20 iterations and save the answers as the columns of x. You can change the 20 to anything that you like. To plot the behavior of x k over the iterations, >> plot(x ); >> legend( node 1, node 2, node ) To find the eigenvalues of a matrix A, type 5

>> eigs(a) To find the eigenvalues and eigenvectors of A, type >> [evec, eval] = eigs(a) The columns of matrix evec are the eigenvectors, and the diagonal entries of eval are the eigenvalues. 6