Inf 2B: Ranking Queries on the WWW

Similar documents
Data Mining and Matrices

1 Searching the World Wide Web

Link Analysis. Leonid E. Zhukov

Node and Link Analysis

Lecture 7 Mathematics behind Internet Search

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Data Mining Recitation Notes Week 3

Node Centrality and Ranking on Networks

Graph Models The PageRank Algorithm

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Math 304 Handout: Linear algebra, graphs, and networks.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

1998: enter Link Analysis

IR: Information Retrieval

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Link Analysis Ranking

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis. Stony Brook University CSE545, Fall 2016

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Google Page Rank Project Linear Algebra Summer 2012

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

How does Google rank webpages?

STA141C: Big Data & High Performance Statistical Computing

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

ECEN 689 Special Topics in Data Science for Communications Networks

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Uncertainty and Randomization

Applications to network analysis: Eigenvector centrality indices Lecture notes

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Applications of The Perron-Frobenius Theorem

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Algebraic Representation of Networks

Online Social Networks and Media. Link Analysis and Web Search

Slides based on those in:

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

0.1 Naive formulation of PageRank

Calculating Web Page Authority Using the PageRank Algorithm

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS47300: Web Information Search and Management

The Second Eigenvalue of the Google Matrix

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

On the mathematical background of Google PageRank algorithm

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Computing PageRank using Power Extrapolation

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Application. Stochastic Matrices and PageRank

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

arxiv:cond-mat/ v1 3 Sep 2004

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Link Mining PageRank. From Stanford C246

eigenvalues, markov matrices, and the power method

How Does Google?! A journey into the wondrous mathematics behind your favorite websites. David F. Gleich! Computer Science! Purdue University!

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Online Social Networks and Media. Link Analysis and Web Search

Where and how can linear algebra be useful in practice.

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

CS6220: DATA MINING TECHNIQUES

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Page rank computation HPC course project a.y

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Eigenvalue Problems Computation and Applications

Finding central nodes in large networks

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Degree Distribution: The case of Citation Networks

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

A Note on Google s PageRank

Applications. Nonnegative Matrices: Ranking

Google and Biosequence searches with Markov Chains

CS6220: DATA MINING TECHNIQUES

c 2005 Society for Industrial and Applied Mathematics

Announcements: Warm-up Exercise:

Web Structure Mining Nodes, Links and Influence

NewPR-Combining TFIDF with Pagerank

COMPSCI 514: Algorithms for Data Science

Markov Chains for Biosequences and Google searches

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Data and Algorithms of the Web

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

The Push Algorithm for Spectral Ranking

Introduction to Data Mining

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Detecting Sharp Drops in PageRank and a Simplified Local Partitioning Algorithm

Eigenvalues of Exponentiated Adjacency Matrices

Data science with multilayer networks: Mathematical foundations and applications

Updating PageRank. Amy Langville Carl Meyer

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

The Google Markov Chain: convergence speed and eigenvalues

Transcription:

Inf B: Ranking Queries on the WWW Kyriakos Kalorkoti School of Informatics University of Edinburgh

Queries Suppose we have an Inverted Index for a set of webpages. Disclaimer Not really the scenario of Lecture. Indexing for the web is massive-scale: many distributed networks working in parallel. We search with a term t. Index has many hits for t (say 36,000 for this t). How should we rank them?

A real search

Ranking Queries Inverted Index (probably) stores the frequency of the term t in each document d (e.g., in previous lecture, our index contains f d,t values). Idea Rank answers to queries in order of frequency of t in the various webpages. Problem Some great websites will not even contain the term t. For example, there are not many occurrences of the term University of Edinburgh" on http://www.ed.ac.uk New Idea Use structure of web to rank queries.

Ranking Queries using web structure Principle: Link from one webpage to another confers authority on the target webpage. This is the concept behind: The Hub-Authority model of Kleinberg. PageRank TM ranking system of Google TM. In early 90s, while PhD students at Stanford, Sergey Brin and Larry Page invented PageRank TM (and founded Google TM ).

PageRank TM Webgraph for a particular query: vertices V = [N] where [N] = {,,..., N} corresponding to pages; links are the directed edges of the graph, so E [N] [N]. Let G = (V, E). Recall: Definition Let u denote some page u [N] in the webgraph. In(u) is the set of in-edges to u. The in-degree in(u) is in(u) = In(u). Out(u) is the set of out-edges from u. The out-degree out(u) is out(u) = Out(u).

PageRank TM Could use in-degree to measure ranking directly. But: Want pages of high rank to confer more authority on the pages they link to. A page with few links should transfer more of its authority to its linked pages than one with many links. Assumptions: (for basic PageRank TM ) No dead-end" pages. Every page can hop to every other page via links. Aperiodic.

Principle of PageRank TM Let R(v) denote the rank of v for any webpage v [N]. For every webpage u in our collection, the following equality should hold: R(u) = R(v)/out(v) v In(u) Rank of u is the total amount of Rank given from the incoming links to u.

PageRank TM in matrix form (R, R,..., R N ) = (R, R,..., R N ) p p... p N p p... p N............ p N p N... p NN where p uv = { /out(u), if v Out(u); 0, otherwise.

PageRank TM in matrix form Shorthand version: R T = R T P, () where P = [p uv ] u,v [N] and R is the vector of ranks for [N]. Equivalent to asking for R = P T R, () Looks like condition for R to be an eigenvector of P T with eigenvalue λ =.

PageRank TM Questions and Answers How do we know that is an eigenvalue of the matrix P T? Answer: P T is a stochastic matrix (each column adds to ), so has eigenvalue. If is an eigenvalue of P T, is it guaranteed to be a simple eigenvalue? i.e., any two vectors that satisfy P T R = R are the same up to a non-zero constant multiple (linearly dependent). Answer: Under our assumptions, there is just one linearly independent eigenvector for.

Example z u w v Example webgraph returned by a rare query in ancient times.

Example z u w v Satisfies all the nice conditions for Basic PageRank TM model (no dead-end pages, can move from any vertex x to any other vertex y, aperiodic).

Example z u w v (R u, R v, R w, R z ) = (R u, R v, R w, R z ) 0 0 0 0 0 0. 3 3 3 0

Example (continued) (R u, R v, R w, R z ) = (R u, R v, R w, R z ) 0 0 0 0 0 0. 3 3 3 0 Can read-off" R w = R z /3, and propagate this into matrix: 0 0 (R u, R v, R w, R z ) = (R u, R v, R w, R z ) 0 0 0 0 0 0. 3 + 6 3 + 6 3 0

Example (continued) Now remove R w (keeping R w = R z /3 to side): (R u, R v, R z ) = (R u, R v, R z ) 0 0 0.

Example (continued) (R u, R v, R z ) = (R u, R v, R z ) 0 0 0 0 0 (R u, R v R z, R z ) = (R u, R v, R z ). 0 Middle equation reads R v R z = /(R z R v ), so R v = R z. Final equation says R z = /(R u + R v ), so R z = R u too. Solution: R u = R v = R z, R w = R z /3.

Alternative (Equivalent) Approach) Expand vector-matrix product: R u = R v + R w + 3 R z R v = R u + R w + 3 R z R w = 3 R z R z = R u + R v. Subtract the second equation from the first: R u R v = R v R u It follows that R v = R u. Substituting into the fourth equation: R z = R u. This method is probably preferable for such small examples.

Example (continued) z u w v Solutions are R u = R v = R z, R w = R z /3, i.e., (R u, R v, R w, R z ) = c(,, /3, ) where c is a constant. Not the same as counting in-degree (for this example).

General PageMark TM model Remove all our assumptions (dead-end pages, connectivity). λ cannot be assumed to be. Need to tinker the model. See Lecture Notes.

Further Reading Nothing in [GT] or [CLRS]. Papers on the web: An Anatomy of a Large-Scale Hypertextual Web Search Engine, by Sergey Brin and Lawrence Page, 998. Online at: http://www-db.stanford.edu/ backrub/google.html The PageRank Citation Ranking: Bringing Order to the Web, by Page, Brin, Motwani and Winograd, 998. Available online from: http://dbpubs.stanford.edu:8090/pub/999-66 Authoritative Sources in a Hyperlinked Environment, by Jon Kleinberg. Available Online from Jon Kleinberg s webpage: http://www.cs.cornell.edu/home/kleinber/