As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Similar documents
Inf 2B: Ranking Queries on the WWW

A Note on Google s PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Google Page Rank Project Linear Algebra Summer 2012

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Mining PageRank. From Stanford C246

Link Analysis Ranking

ECEN 689 Special Topics in Data Science for Communications Networks

Math 304 Handout: Linear algebra, graphs, and networks.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Graph Models The PageRank Algorithm

IR: Information Retrieval

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Applications to network analysis: Eigenvector centrality indices Lecture notes

CS47300: Web Information Search and Management

Data Mining and Matrices

Online Social Networks and Media. Link Analysis and Web Search

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Lecture 12: Link Analysis for Web Retrieval

Link Analysis and Web Search

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

0.1 Naive formulation of PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS6220: DATA MINING TECHNIQUES

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis. Stony Brook University CSE545, Fall 2016

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Lecture 7 Mathematics behind Internet Search

Slides based on those in:

Cutting Graphs, Personal PageRank and Spilling Paint

1 Searching the World Wide Web

MAT 1302B Mathematical Methods II

Data Mining Recitation Notes Week 3

Link Analysis. Leonid E. Zhukov

Class President: A Network Approach to Popularity. Due July 18, 2014

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

CS6220: DATA MINING TECHNIQUES

eigenvalues, markov matrices, and the power method

Data and Algorithms of the Web

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Application. Stochastic Matrices and PageRank

Online Social Networks and Media. Link Analysis and Web Search

STA141C: Big Data & High Performance Statistical Computing

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Introduction to Data Mining

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

1998: enter Link Analysis

Calculating Web Page Authority Using the PageRank Algorithm

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Eigenvalues of Exponentiated Adjacency Matrices

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Physical Metaphors for Graphs

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Complex Social System, Elections. Introduction to Network Analysis 1

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Fiedler s Theorems on Nodal Domains

Lecture: Local Spectral Methods (1 of 4)

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Data science with multilayer networks: Mathematical foundations and applications

Spectral Analysis of Random Walks

Introduction to Information Retrieval

Multimedia Databases - 68A6 Final Term - exercises

PV211: Introduction to Information Retrieval

Node and Link Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Node Centrality and Ranking on Networks

Degree Distribution: The case of Citation Networks

AN APPLICATION OF LINEAR ALGEBRA TO NETWORKS

Kristina Lerman USC Information Sciences Institute

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

On the mathematical background of Google PageRank algorithm

Web Structure Mining Nodes, Links and Influence

CS249: ADVANCED DATA MINING

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Faloutsos, Tong ICDE, 2009

Learning Packet. Lesson 5b Solving Quadratic Equations THIS BOX FOR INSTRUCTOR GRADING USE ONLY

Computing PageRank using Power Extrapolation

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Personal PageRank and Spilling Paint

Machine Learning CPSC 340. Tutorial 12

Transcription:

Graphs and Networks Page 1 Lecture 2, Ranking 1 Tuesday, September 12, 2006 1:14 PM I. II. I. How search engines work: a. Crawl the web, creating a database b. Answer query somehow, e.g. grep. (ex. Funk search) c. Revolution came with two ideas for exploiting link structure (beyond indegree) i. PageRank of Brin-Page, Google ii. Hits, Kleinberg's algorithm, hubs and authorities iii. Downside of being at IBM. We will start with PageRank, which Google supposedly uses a. Disclaimer: what Google actually does is not public knowledge. b. Main idea: give a score to every page on the web c. When get a query, use old technology to get a list of pages (e.g. grep), and then display them in the order given by their scores. d. PageRank is the algorithm for assigning a score/rank. First approximation of PageRank. a. This first approximation is not quite what do, but is a good approximation. Will fill fix some details later in the lecture, once we know enough for it to make sense. b. Notation: for a node v, let d + (v) denote the number of edges leaving v, the out-degree of v, and d - (v) be the number of edges entering v, the in-degree of v. c. We would like PageRank to be a function from vertices to the non-negative reals, r such that d. As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation e. For some c > 0. Finally, r is normalized so that sum_v r(v) = 1. Here are some example, but without the normalization so the numbers are easier to read.

Graphs and Networks Page 2 1. Does a PageRank vector exist? Yes. a. Here's one way to compute a PageRank vector. It converges relatively quickly, as can be seen from the example. In fact, the actual vector converges even faster. b. c. d. Is the PageRank vector unique? i. Not necessarily. If the graph has disconnected components, there will be many solutions to the equations. Correct pagerank algorithm will fix this. If the graph is strongly connected: for every u and v, is a directed path from u to v, then r exists, is unique, and has c = 1. We will now prove this fact. But first, we must introduce some matrix notation for dealing with this. The Adjacency matrix of a graph is a matrix A such that Recall that Au,v is the entry in the uth row and vth column. Now, define the matrix D by And, the matrix M by

Graphs and Networks Page 3 Here's an example: Note that the sum of the entries in every row of M is 1. For M to be welldefined, we require that d + (v) > 0 for all v. Now, the PageRank equations become r = c r M. That is, r is a lefteigenvector of M of eigenvalue 1/c. Now, let's look carefully at this example, and see why the following vector satisfies the PageRank equations:

Graphs and Networks Page 4 a. b. c. Again, let's consider the case in which the graph is strongly connected, and show that 1 is an eigenvalue. This is necessary to show that c=1 is achievable. To show this, we show that the all-1's vector is a righteigenvector of eigenvalue 1. To see this, note that A*1 is a column-vector giving the out-degree of every vertex. So, D -1 (A*1) is again the all-1's vector. Now, as the right and left-eigenvalues are the same, we know that M has a left-eigenvector of eigenvalue 1. In a few moments, we will show that the left-eigenvector of eigenvalue 1 is non-negative, and unique. But, for now let's assume that a non-negative vector r exists such that r = r M, and show this implies it is unique. First, we establish that if the graph is strongly connected, r = r M, and r is non-negative and has at least one non-zero entry, then r is strictly positive. Assume r(u0) > 0. Then, for every v, there exists a directed path from u0 to u1 to to uk to v. From the PageRank equations, we have d. Now, let's observe that if r and s are two vectors that satisfy r = r M and s = s M, then for every beta, (r + beta s) = (r + beta s) M. This follows simply from linearity: e. f. Finally, note that if r is strictly positive, and r and s satisfy r = r M and s = s M, then there is a beta such that the vector (r - beta s) is non-negative, but has at least one zero entry. As we will also have (r - beta s) = (r - beta s) M, this will contradict the point established above. So, any such r must be unique. I was going to show that if the graph is strongly connected and r = r M then r must be non-negative. But, I ran out of time. So, we will move this fact to a problem set. There are two more things I should say about how PageRank works. The first is how they handle nodes with out-degree zero. Brin and Page say that they ignore these nodes at first, solve the remaining problem, and then put those nodes back in. We could think of many reasonable ways of making this formal. I'm not sure which one they actually do. To explain the second point, let me first observe that in the algorithm as we specified so far, each node make equal contributions to the ranks of the nodes it points to. It would be easy to imagine that a node could make different

Graphs and Networks Page 5 contributions to each of the nodes it points to. In fact, Brin and Page make each node make a small contribution to every other node. Formally, the matrix they consider is given by: Which can also be written: Now, let me explain the probabilistic approach to understanding pagerank. They imagine a monkey surfing the web. At each time step, the monkey chooses to follow a random link on a page (with probability 1- alpha), or to go to a completely random place on the web (with probability alpha). Imagine that the monkey surfs for a very long time. Eventually, there is one probability distribution giving the chance that the monkey is at each page. The probability of being at page v becomes r(v), where r is the pagerank vector. To see this, consider the probability of being at a page v. It is the sum over all pages u that point to v of the probability of being at u, times the probability of jumping from v to u. This is exactly what is captured by the equation

Finally, I also mentioned Kleinberg's algorithm, called Hits. Kleinberg had the idea that some nodes are authorities on certain topics, and others are hubs, which list authorities. For example, my web page points to authorities on many topics. So, he assigned each node a hub-weight, h(v), and an authority-weight, a(v). A page that is pointed to by many good hubs becomes an good authority, and a page that points to many good authorities should be a good hub. So, he asked that the vectors a and h satisfy: Graphs and Networks Page 6