THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

Similar documents
COMPOUND MATRICES AND THREE CELEBRATED THEOREMS

Graph Models The PageRank Algorithm

Link Analysis. Leonid E. Zhukov

INFORMATION-THEORETIC EQUIVALENT OF RIEMANN HYPOTHESIS

VISUALIZATION OF INTUITIVE SET THEORY

Page rank computation HPC course project a.y

Data Mining and Matrices

Data Mining Recitation Notes Week 3

1998: enter Link Analysis

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Calculating Web Page Authority Using the PageRank Algorithm

On the mathematical background of Google PageRank algorithm

Inf 2B: Ranking Queries on the WWW

Applications to network analysis: Eigenvector centrality indices Lecture notes

Application. Stochastic Matrices and PageRank

eigenvalues, markov matrices, and the power method

COMPOUND MATRICES AND THREE CELEBRATED THEOREMS

Justification and Application of Eigenvector Centrality

Google Page Rank Project Linear Algebra Summer 2012

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

How does Google rank webpages?

The Google Markov Chain: convergence speed and eigenvalues

IR: Information Retrieval

Uncertainty and Randomization

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Math 304 Handout: Linear algebra, graphs, and networks.

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

A Note on Google s PageRank

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

ECEN 689 Special Topics in Data Science for Communications Networks

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

1 Searching the World Wide Web

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

The Second Eigenvalue of the Google Matrix

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Node Centrality and Ranking on Networks

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Algebraic Representation of Networks

Applications. Nonnegative Matrices: Ranking

Updating PageRank. Amy Langville Carl Meyer

Perron Frobenius Theory

Affine iterations on nonnegative vectors

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

MATH36001 Perron Frobenius Theory 2015

Fiedler s Theorems on Nodal Domains

Section 1.7: Properties of the Leslie Matrix

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Fiedler s Theorems on Nodal Domains

Link Analysis Ranking

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

ATotallyDisconnectedThread: Some Complicated p-adic Fractals

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Lecture 7 Mathematics behind Internet Search

Lecture: Local Spectral Methods (1 of 4)

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Announcements: Warm-up Exercise:

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Google and Biosequence searches with Markov Chains

arxiv:cond-mat/ v1 3 Sep 2004

Authoritative Sources in a Hyperlinked Enviroment

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Monte Carlo methods in PageRank computation: When one iteration is sufficient

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Google Matrix, dynamical attractors and Ulam networks Dima Shepelyansky (CNRS, Toulouse)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Markov Chains, Stochastic Processes, and Matrix Decompositions

Link Analysis. Stony Brook University CSE545, Fall 2016

Online Social Networks and Media. Link Analysis and Web Search

Random point patterns and counting processes

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

On the adjacency matrix of a block graph

Markov Chains for Biosequences and Google searches

Degree Distribution: The case of Citation Networks

Administrivia. Blobs and Graphs. Assignment 2. Prof. Noah Snavely CS1114. First part due tomorrow by 5pm Second part due next Friday by 5pm

A probabilistic proof of Perron s theorem arxiv: v1 [math.pr] 16 Jan 2018

Link Mining PageRank. From Stanford C246

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Computing PageRank using Power Extrapolation

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

Markov Chains. As part of Interdisciplinary Mathematical Modeling, By Warren Weckesser Copyright c 2006.

The Giving Game: Google Page Rank

YOU CAN BACK SUBSTITUTE TO ANY OF THE PREVIOUS EQUATIONS

SUPPLEMENTARY MATERIALS TO THE PAPER: ON THE LIMITING BEHAVIOR OF PARAMETER-DEPENDENT NETWORK CENTRALITY MEASURES

A MATRIX ANALYSIS OF FUNCTIONAL CENTRALITY MEASURES

Applications of The Perron-Frobenius Theorem

Link Analysis and Web Search

MATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems

Where and how can linear algebra be useful in practice.

Product distance matrix of a tree with matrix weights

MATH 5524 MATRIX THEORY Pledged Problem Set 2

Root systems and optimal block designs

Web Structure Mining Nodes, Links and Influence

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Transcription:

THEORY OF SEARCH ENGINES K. K. NAMBIAR ABSTRACT. Four different stochastic matrices, useful for ranking the pages of the Web are defined. The theory is illustrated with examples. Keywords Search engines, Page rank, Web. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5 1. INTRODUCTION Granted that the Web contains an incredible amount of information, it should be admitted that the problem of finding the relevant one of interest to you is by no means easy. Recently, there has been attempts to circumvent this difficulty by ranking the pages of the web according to their importance, and list only those pages of high rank containing the required information. The popular Google employs this method [2] in their search engine, and much of their success is attributed to their method of ranking the pages in the Web. The purpose of this paper is to delineate the theory that can be used in ranking the pages and hyperlinks of the Web. It is obvious that the Web can be considered as a directed graph with pages as nodes and hyperlinks as directed edges. In the Web graph (webgraph), we will say that two nodes are connected, if there are directed paths both ways between the two nodes. Clearly, connectedness is an equivalence relation and hence it splits the set of nodes of the webgraph into disjoint subsets. The subgraph induced by such a subset, we will call a community graph or just a community. If we coalesce all the nodes of a community and consider it as a single node (community node), the webgraph reduces to an acyclic graph, in which no two community nodes can have directed paths both ways. This graph we will call the reduced webgraph and represent the indegree of each community node by e k. We define the exponential Date: January 8, 2001. 1

2 K. K. NAMBIAR mean of {e k } as ν = k e k c k where c k = e k /E and E = k e k. Since the indegree of a community node gives an indication of its importance in the Web, it will not be unreasonable for us to give special importance to a community, whose indegree is not less than the exponential mean ν. Consider a Web surfer starting from an arbitrary node in a community and surfing within the community in all possible ways. The number of different paths the surfer can take can be infinite and at no time will the surfer be faced with the situation when he does not have a hyperlink available to click. We define the freedom in a community as the logarithm (base 2) of the increase in the number of different paths per mouse click. Specifically, we define freedom as log C = lim 2 P N N N where P N is the number of different paths possible with N clicks. We define activity in the community as A = 2 C. It is known that the activity of the community will not depend on the initial starting point of the surfer. 2. RANKING OF PAGES We will show that the community graph can be used to calculate the activity in the community. The graph can be used also to rank the pages and hyperlinks of the community. The crucial concepts we will use to calculate these are the Perron- Frobenius theorem and Shannon s theory of communication [1, 3]. The communities in the Web consist of millions of pages and hyperlinks, yet it is a finite graph and hence we can study their properties by analyzing graphs with a few number of nodes, which is what we will do here. The usual way to analyze a matrix is through its characteristic equation and eigenvectors, here we do the same thing, except that we start off with a slight variation, we call, intrinsic equation. Our first job is to characterize a community by a matrix. An example of our notation is, 3s s 2s N(s) = 3s 5s 6s s s 4s which characterizes a community with three pages. The (i, j) element specifies the number of hyperlinks from page i to page j. For example, the (2, 1) element 3s says that there are three hyperlinks from page 2 to page 1, and similar explanation goes for other elements also. The purpose of the variable s will be clear from the next example. The above notation can be extended slightly to include hyperlinks that go through intermediate nodes. Consider the matrix, [ 0 s N(s) = 2 + s 4 ] s 3 + s 6 s 2 + s 4.

THEORY OF SEARCH ENGINES 3 Here, the (2, 1) element s 3 + s 6 says that there are two ways to reach page 1 from page 2, one by taking three hops, another by taking six hops, through pages of no interest to us. We will use these two matrices as examples to illustrate our method of ranking pages and links. We need some definitions and notations to proceed any further. N(s): The matrix which characterizes the community, as in the two examples above. I: A unit matrix of appropriate dimension. I N(s): The intrinsic matrix of the community. The connection between the usual characteristic matrix and the intrinsic matrix can be recognized, if we put s = λ 1. det{i N(s)} = 0: The intrinsic equation of the community. µ: The smallest positive root of the intrinsic equation. Perron-Frobenius theorem assures us that there will always be a smallest positive root. C: A measure of the freedom in the network, as defined earlier. Shannon s theory of communication tells us that C = log 2 µ. A: A measure of the activity in the network, as defined earlier. Shannon s theory of communication tells us that A = µ 1. p: The intrinsic column vector of N(s). The intrinsic column vector is defined as the solution of N(µ)p = p, with positive elements, the sum of the elements being unity. Perron-Frobenius theorem assures us the existence of such a p. We will call this matrix, customer-ranking matrix or c-matrix, since the magnitude of its elements gives the prominence of the page as a customer. q: The intrinsic row vector of N(s). The intrinsic row vector is defined as the solution of qn(µ) = q, with positive elements, the sum of the elements being unity. Perron-Frobenius theorem assures us the existence of such a q. We will call this matrix, vendor-ranking matrix or v-matrix, since the magnitude of the elements of the matrix gives the prominence of the page as a vendor. r: A row matrix that can be used to rank the pages of the community. It is defined as (qp) 1 [p j q j ]. We will call this matrix, r-matrix. P: A square matrix that can be used to rank pairs of pages in the community. It is defined as [p i q j ]. We will call this matrix, link-ranking matrix or l-matrix, since the magnitude of the elements of the matrix gives the prominence of links between pairs of pages. The diagonal elements here give the ranking of the corresponding pages. e k : The indegree of the k th community node in the Web. E: The total number of edges in the reduced graph, k e k, as defined earlier. c k : The fractional degree of the k th community node, e k /E, as defined earlier.

4 K. K. NAMBIAR 3. TWO EXAMPLES Example 1. 3s s 2s N(s) = 3s 5s 6s s s 4s The required computations here are well-known and hence we omit all details. det{i N(s)} = 0 gives the roots as 0.5, 0.5, and 0.125. µ = 0.125 C = log 2 µ = 3 A = µ 1 = 8 p = 0.2 0.6 0.2 q = [ 0.25 0.25 0.50 ] r = 10 [ ] 0.05 0.15 0.10 3 0.05 0.05 0.10 P = 0.15 0.15 0.30 0.05 0.05 0.10 Example 2. [ 0 s N(s) = 2 + s 4 ] s 3 + s 6 s 2 + s 4. µ = 0.688278 C = log 2 µ = 0.538937 A = µ 1 = 1.4529 [ ] 0.411122 p = 0.588878 q = [ 0.301855 0.698145 ] r = [ 0.231865 0.768135 ] [ ] 0.124099 0.287023 P = 0.177756 0.411122

THEORY OF SEARCH ENGINES 5 4. CONCLUSION Any of the four matrices p, q, r, or P can be used to rank the pages of a community. For example, if we are interested in pages considered as vendors, we will compare the elements of the matrix q and list those pages corresponding to the larger elements of q. The ranking of the vendor pages of the whole webgraph will require the comparison of the elements of the entire set of matrices {c k q k } where q k is the v-matrix of the k th community. From the published literature, it seems that a variation of this method has been used by Google in their search engine. As an aside, we may state that the matrix N(s) in Example 2 can be considered as a model for the telegraph channel, as defined by Shannon [3]. This shows that there is considerable overlap between the theories of search engines and communication channels. For example, the freedom in the community in Example 2 is the same as the capacity of the telegraph channel. REFERENCES 1. R. B. Bapat and T. E. S. Raghavan, Nonnegative Matrices and Applications, Cambridge University Press, Cambrige, UK, 1997. 2. Sergey Brin and Laurence Page, The Anatomy of a Large Scale Hypertextual Web Search Engine, Proceedings of the Seventh International WWW Conference (Brisbane, Australia), April 14-18, 1998. 3. C. E. Shannon, The Mathematical Theory of Communication, University of Illinois Press, Chicago, 1963. FORMERLY, JAWAHARLAL NEHRU UNIVERSITY, NEW DELHI, 110067, INDIA Current address: 1812 Rockybranch Pass, Marietta, Georgia, 30066-8015 E-mail address: nambiar@mediaone.net