Lecture 7 Mathematics behind Internet Search

CCST907 Hidden Order in Daily Life: A Mathematical Perspective Lecture 7 Mathematics behind Internet Search Dr. S. P. Yung (907A) Dr. Z. Hua (907B) Department of Mathematics, HKU

Outline Google is the preferred search engine Google s PageRank Ranking web pages by importance scores Iterative steps for finding importance scores Limiting Scores Theorem Perron-Frobenious Theorem

Google and other search engines You probably use Google everyday to search the internet. There are many search engines other than Google, but Google gets a large chunk of search market: Try to search: Lloyd Shapley Google (64.6%) Yahoo! (6%) [Powered by Bing since 0] Why people choose Google? Microsoft/Bing (0.7%) [Homepage similar to Google now] AOL (.%) Ask (.7%) [Powered by HITS] My Web (.%) (Others:.7%) Market share in terms of number of searches in the US in August 009. Source: Nielsen MegaView Search

Google s PageRank The PageRank Algorithm: One major ingredient of Google search engine Rates importance of each webpage of the internet Put important/relevant pages at front upon search requests Consequences: More people use Google Webmasters want to boost up PageRank of their webpages Trading of high PageRank webpages

How it begins Background: text based web pages (*.html). Hypertext is text with links to other text. Jon Kleinberg developed Hyperlink-Induced Topic Search (HITS) in 998. Jon Kleinberg (Cornell University) A new idea of using hyperlink structure of the web to improve search engine results Before this, most search engines used textual content to return relevant document, and the results are not too satisfactory. E.g.: 907 pages contain Lloyd Shapley, the 900 th is about the Nobel prize winner. Need to put them in some order of relevancy.

How it begins Nearby, two PhD students Sergey Brin and Larry Page were using similar (but not the same) ideas. Brin and Page (Stanford U.) Page got an idea about ranking of importance of webpages from the citations of scientific literature. E.g., published work of Shapley may have citations from 0,000 different papers. If replacing citations by links, the same can be said about the importance of websites. Thus, the sites with the most links pointing to them should be considered as more important.

How it begins On the other hand, Page also realized that not all links are created equal: if a site is pointed to by an important site, this would also raise the importance of that site. This is similar to a reference letter from a public figure increase one s importance. Page began calling his link-rating scheme PageRank, and Brin and their PhD advisors worked together to develop an algorithm. With their search algorithm, Brin and Page started a business at their dorm rooms in 998. It later became the giant Google benefited from the elegant mathematics behind PageRank. The name Google was the result of misspelling googol, which means 0 00

Ranking pages Definition (Importance score) Importance score (or just score) means quantitative rating of a web page s importance. We ll use a nonnegative number to represent it. Definition (Backlinks) The inward links to a given page are the backlinks of that page. Basic ideas of PageRank on ranking pages: A page is important if it is pointed to by many other pages, or by other important pages

Web as a directed graph Suppose the web of interest has n pages, each page indexed by a number,,..., n. Represent each forward link by an arrow. Example of a web with 4 pages: A 4-page web Page 4 has backlinks Denote x i the importance score of page i. How can we determine x, x, x, x 4 for our 4-page web?

Formulating importance scores Three rules of the importance scores: R0 Initially, without taking into account the hyperlink structure of the web, all pages are equally important with the same score. For our 4-page web, x = x = x = x 4 =

Formulating importance scores Three rules of the importance scores: After taking into account the hyperlink structure of the web, page scores should be updated as followings. R A page has higher score if it is pointed to by more pages; R A page has higher score if it is pointed to by other important (i.e. higher score) pages. Note R means that the score x i of page i is also the influence power of page i to other pages. And R means that the score x i reflects the total influences from all other pages pointing to page i.

Formulating importance scores To update page scores according to R, we count the number of backlinks to each page (R is now ignored but will be dealt with later). Assuming each backlink has the same weight of power, one may update the page score as x if a page has x backlinks. Consider page in our 4-page web, define x =?, if page has a backlink b,i = from page i; 0, otherwise. Then x = Total backlinks from pages, & 4 = b, + b, + b,4 = (0) + () + () = which is sum of Power of page i (x i ) b,i for i =, & 4.

Formulating importance scores Update other page scores according to R: x = b, + b, + b,4 = () + (0) + (0) = x = b, + b, + b,4 = () + () + () = x 4 = b 4, + b 4, + b 4, = () + () + (0) = As a result, x, x, x, x 4 are all updated simultaneously. However, we need to use x i and b j,i in each Power of page i (x i ) b j,i term better.

Formulating importance scores After updating our 4-page web according to R: x = x = x = x 4 = The way that we assign to b,i has a defect. The power of each page should not be identical because page 4 has sent out two arrows to page, and page has sent out only one arrow to page. So each arrow from page 4 should only be counted as / and same for all the other pages. We now abolish the old weights and re-do the calculations using new weights.

Formulating importance scores Denote N i = number of outward links from page i. Then, all N outward links from page should carry the same weight, say W, given by W = No. of outward links from page = N = For the others, we also set W i = N i. Therefore, we have W =, W =, W 4 = in the updating. and use them

Formulating importance scores Recall that the modified backlink weights for our 4-page web are W =, W =, W =, W 4 =. Consider again page, b,i should be set as x =? W i, if page has a backlink b,i = from page i; 0, otherwise. Then x = b, + b, + b,4 = (0) + (W ) + (W 4 ) ( ) = (0) + () + = which is just the sum of x i b,i for i =,, 4.

Formulating importance scores Update other page scores accordingly: x = b, + b, + b,4 = (W + (0) + (0) ( ) = + (0) + (0) = x = b, + b, + b,4 = (W + (W + (W 4 ( ) ( ) ( ) = + + = x 4 = b 4, + b 4, + b 4, = (W + (W + (0) ( ) ( ) = + + (0) = 5 6

Formulating importance scores After updating our 4-page web according to R: x = x = x = x 4 = 5 6 Note that the newly computed score for each page is different than the old score that was originally used in the updating. Since each page now has a different score (i.e. power), R demands further updating using the newly computed page scores.

Formulating importance scores For our 4-page web, initial scores are x = x = x = x 4 = After updating our 4-page web according to R: x = x = x = x 4 = 5 6 R demands further update of scores: e.g., the backlinks of page comes from pages & 4 with scores initially &, but now & 5 6 ; the score x should be further updated to reflect these changes of importance of each page.

Formulating importance scores Recall that the backlink weights for our 4-page web are W =, W =, W =, W 4 =. x =? W i, if page has a backlink b,i = from page i; 0, otherwise. x = b, + b, + b,4 Consider again page, the score should now be updated as x = x b, + x b, + x 4 b,4 = x (0) + x (W ) + x 4 (W 4 ) ) = (0) + () + 5 6 ( = 4

Formulating importance scores Update other page scores according to R: x = x b, + x b, + x 4 b,4 = x (W + x (0) + x 4 (0) = ( ) + (0) + 5 6 (0) = x = x b, + x b, + x 4 b,4 = x (W + x (W + x 4 (W 4 = ( ) + ( ) + 5 ( ) 6 = x 4 = x b 4, + x b 4, + x b 4, = x (W + x (W + x (0) = ( ) + ( ) + (0) =

Formulating importance scores After further updating our 4-page web as required by R: x = 4 x = x = x 4 = Each newly computed score seems to be different than the old scores! Is this an endless process? The answer would be no if we can make x, x, x, x 4 remain unchanged after certain number of updates. Will this happen? Explanation will be described below (using matrix-vector notation).

Iterative steps for finding scores First for a web of n pages, introduce the notations e, e,..., e n, each equal to, for the initial equally important scores. For our 4-page web, initial scores are e = e = e = e 4 = Then we call the scores after first update the importance scores after iteration. For clarity we denote them as x [k], x [k],..., x n [k], where the superscript [k] indicates that they are scores after k iteration. Conventionally, we denote the initial score e, e,..., e n by x [0], x [0],..., x n [0].

Iterative steps for finding scores The importance scores after iteration are x [] = x [] = x [] = x [] 4 = 5 6 Similarly, the importance scores after iteration are x [] = 4 x [] = x [] = x [] 4 =

The link matrix For a web of n pages, outward links from page i carry the same weight: W i = No. of outward links from page i For a fixed page j, each backlink associates with a number: { W i, if page j has a backlink from page i; b j,i = 0, otherwise. Define the link matrix of an n-page web as 0 b, b, b,n b, 0 b, b,n A = b, b, 0 b,n.... b n, b n, b n, 0

The link matrix The backlink weights for our 4-page web are W =, W =, W =, W 4 =. Then the link matrix is 0 b, b, b,4 0 0 W W 4 A = b, 0 b, b,4 b, b, 0 b,4 = W 0 0 0 W W 0 W 4 b 4, b 4, b 4, 0 W W 0 0 0 0 0 0 0 = 0 0 0

The link matrix Interpretation of the link matrix: 0 0 0 0 W W 4 W 0 0 0 W W 0 W 4 = 0 0 0 0 W W 0 0 0 0 Zero diagonal entries means there is no link from a page to itself. Column i (vertical) represents outward links from page i; Sum of each column is reflects the initial power of each page is. Row j (horizontal) represents backlinks to page j; Put W i at the ith position (of row j) if page i links to page j; otherwise put a 0 there.

Iterative steps for finding scores We used e =, e =, e =, e 4 = in the calculation of the importance scores after iteration ( ) x [] = (0) + () + = ( ) x [] = + (0) + (0) = ( ) ( ) ( ) x [] = + + = ( ) ( ) x [] 4 = + + (0) = 5 6 x [] = x [] = x [] = x [] 4 = 5 6

Iterative steps for finding scores Then the calculation of the importance scores after iteration ( ) x [] = e (0) +e () +e 4 = ( ) x [] = e +e (0) +e 4 (0) = ( ) ( ) ( ) x [] = e +e +e 4 = ( ) ( ) x [] 4 = e +e +e (0) = 5 6 can be written as a matrix-vector equation 0 0 e e A e e = 0 0 0 e = 0 e e 4 0 0 e 4 x [] x [] x [] x [] 4

Iterative steps for finding scores We used x [] =, x [] =, x [] =, x [] 4 = 5 6 in the calculation of the importance scores after iteration x [] = x [] ) = ( ) x [] = ( x [] 4 = ( ) (0) + () +5 6 ) + ( + ( ) ( ) = 4 + (0) +5 6 (0) = + 5 ( ) 6 = + (0) = x [] = 4 x [] = x [] = x [] 4 =

Iterative steps for finding scores Then the calculation of the importance scores after iteration x [] = x [] ) x [] = x [] x [] = x [] ( ) ( ( ) x [] 4 = x [] x [] 4 +x [] ( [] [] (0) +x () +x 4 ) ( ( ) +x [] +x [] [] (0) +x ) = 4 4 (0) = +x [] 4 ( ) = +x [] (0) = can be written as a matrix-vector equation (with the same A) x [] 0 0 x [] x [] x [] A 0 0 0 x [] x [] = x [] 0 x [] = x [] 0 0 x [] 4 x [] 4

Iterative steps for finding scores To satisfy R, we can continue to calculate the importance scores after iterations as x [] 0 0 x [] x [].4... x [] A 0 0 0 x [] x [] = x [] 0 x [] = 0.58... x [] =.6... 0 0 0.8... x [] 4 x [] 4 x [] 4 In general, the importance scores of our 4-page web after k iteration is given by x [k] x [k] x [k] x [k] 4 = A x [k ] x [k ] x [k ] x [k ] 4 = A k e e e e 4 (Is this an endless process?)

Iterative steps for finding scores The iterative formula of the importance scores for our 4-page web can be generalized to a web of n pages. Let A be the link matrix of a web of n pages. Let e = [e, e,, e n ] T be a column vector of n entries of. The scores x [],..., x n [] after iteration are given by 0 b, b, b,n x [] b, 0 b, b,n Ae = b, b, 0 b,n x [] =..... = x []. b n, b n, b n, 0 or equivalently x [] = A e x [] n where x [] is the column vector with entries x [],..., x [] n.

Iterative steps for finding scores The scores after iteration are given by x [] = A x [] = A(Ae) = A e In general, the scores after k iterations are given by x [k] = A x [k ] = A k e ( ) Note that this iteration process can go on and on indefinitely, to obtain newer and newer sets of scores. Question: Will this iteration process go on and on without settling into a fixed score?

Limiting Scores Theorem Thanks to nice properties of A, here is an answer: Theorem (Limiting scores) Suppose the web is interconnected in such a way that one can travel from any given page to any other given page through the existing links (in this case we say the link matrix is irreducible). Let x [k], x [k],..., x n [k] denote the scores of the pages after k iterations. Then, when k becomes bigger and bigger, either () the set of scores will converge to a unique set of limit scores, or, when () does not hold, () then for each page i, the average of its scores x [0] i, x [] i,..., x [k] i will approach a limit score.

Limiting Scores Theorem According to the Theorem, if the web is well-connected enough, then one may just compute the scores x [k], x [k],..., x n [k] by the iterative formula ( ), and observe whether these sets of scores will approach any set of limit scores when k becomes big. If these sets of scores approach some limit set of scores, which is case () of the Theorem, then these limit scores will be the final scores of the pages. On the other hand if case () of the Theorem does not hold, then case () must hold. In this case the average of x [0] i, x [] i,..., x [k] i, when k is big enough, will be taken as the final score of page i. The above will establish the final ranking of the pages. Q: How to find these final scores? A: Solve Av = v with A being the link matrix.

Stochastic matrix and -eigenvector The Limiting Scores Theorem says when k is bigger and bigger, either x [k] will approach to a vector v, or the average of x [0], x [],..., x [k] will approach a vector v. In fact, in either case, v will satisfy Av = v. Definition (-eigenvector) Given an n n matrix a a a n a a a n A =.. a n a n a nn we say that a vector w is a -eigenvector of A if w is not a zero vector and it satisfies Aw = w.

Stochastic matrix and -eigenvector Definition (stochastic matrix) An n n matrix A is called a (column) stochastic matrix if all its entries are nonnegative, and all its column sums are. It is not difficult to see that if a web is such that all its pages have link(s) to some other page(s), then the link matrix of the web 0 b, b, b,n b, 0 b, b,n A = b, b, 0 b,n.... b n, b n, b n, 0 is a stochastic matrix.

Stochastic matrix and -eigenvector To find the -eigenvector v of a stochastic matrix A, one can compute x [0] = Ae, x [] = Ax [0],..., x [k] = Ax [k ],... When k is large enough, x [k] will approach v. Remark: starting with the vector e is required by R0 which states that: Initially, all pages are equally important with the same score. In many cases, it does not take many iterations to observe that the largest entries of x [k] are already in some particular positions, which identify those pages which are most important.

Perron-Frobenious Theorem In fact the Limiting Scores Theorem is a result of Theorem (Perron-Frobenius) For any stochastic matrix A, it must have a -eigenvector v whose entries are all nonnegative and satisfies Av = v. Moreover, if A is irreducible, then it must have a -eigenvector v with entries which are all positive. For the link matrix of a web, if it is an irreducible stochastic matrix then its -eigenvector v with positive entries will determine the ranking of the pages the bigger the i-th entry of v, the higher the rank of page i.

When A is not irreducible Suppose a web is not well-connected, so that some page cannot go to some other page via a route of existing links. Then the corresponding link matrix A is not irreducible, and the Limiting Scores Theorem cannot apply. For simplicity we assume all pages of the web has some links pointing to some other pages, so that the link matrix will not have a column of zeros, and hence is still a stochastic matrix. For this stochastic but not irreducible matrix A, define a matrix S = ( α)a + αe, where 0 < α and E is the n n matrix with all entries equal to n. Then all entries of S are positive, and S is an irreducible stochastic matrix. We may choose α to be a very small positive number so that S is very close to A. Using S as the link matrix instead of A, we may compute scores x [k] i as before. It turns out that, since all entries of S are positive, case () of the Limiting Scores Theorem will hold. We may then take the limit scores as the final page scores of the web.

Google s PageRank (cont.) Main idea of Google s PageRank method: Form a (huge!) stochastic matrix A which represents the link-structure of the www. Form matrix S = ( α)a + αe, where α is a small positive number. Compute the -eigenvector v of S which has all entries positive. The ranking of webpages follows the magnitude of the entries of v. To find the -eigenvector v of S, one can compute x [0] = Se, x [] = Sx [0],..., x [k] = Sx [k ],... When k is large enough, x [k] will approach v.

Example For our example: 4 0 0 0 0 0 A = 0 0 0 After verifying that A is irreducible, there is no need to define S and we may compute the scores x [k] directly using A:.5 x [] = Ae = 0.., x [] = Ax [] = 0.8.75 0.5.08 0.667,

Example (cont.).47.58 x [] = Ax [] = 0.58.67 = Ax [] = 0.47.8 0.8 0.764.56.58 x [5] = Ax [4] = 0.58.46 = Ax [5] = 0.5.67 0.764 0.785 One may pick up the trends from these results and conclude that webpage should rank st, webpage ranks nd, webpage 4 ranks rd, and webpage ranks 4th..5489... In fact the -eigenvector of A is 0.56....69.... 0.7749...

Assignment 7 Q. Assignment 7 Due date: Oct 0 (Monday) before :00pm. Please put your assignment into the assignment box of this course. Please write your tutorial group number on the right hand corner of your assignment. Question Your Google Twin is the person you find as a search result for your own full name on Google. Consider the 4-page web example we have been using in the lecture. The owner of page finds that page is her Google Twin. Upset, she creates a new page 5 that links to page and page also links to page 5. Will this help boost her PageRank score?

Assignment 7 Q. Assignment 7 Question Let A be a matrix ( ) a b. c d An eigenvalue of A is defined to be a number λ such that there exists a (column) vector v such that Av = λv. For a matrix, the eigenvalues of A are roots of the polynomial λ (a + d)λ + (ad bc) = 0. Now suppose A is a column stochastic matrix. Show that there exists no eigenvalue λ of A such that λ >.

References The $5,000,000,000 eigenvector: the linear algebra behind Google, K. Bryan & T. Leise, SIAM Review, 48 (006), 569 58. Deeper inside PageRank, Amy N. Langville & Carl D. Meyer, Internet Mathematics, vol. no., 00, pp.5-80. The Google Story, David A. Vise, Pan Books, 006. How Google Finds Your Needle in the Web s Haystack (www.ams.org/samplings/feature-column/fcarc-pagerank).