How to optimize the personalization vector to combat link spamming

Size: px

Start display at page:

Download "How to optimize the personalization vector to combat link spamming"

Tracey Fisher
5 years ago
Views:

1 Delft University of Technology Faculty of Electrical Engineering, Mathematics and Computer Science Delft Institute of Applied Mathematics How to optimize the personalization vector to combat link spamming Report for the Delft Institute of Applied Mathematics as part of BACHELOR OF SCIENCE in APPLIED MATHEMATICS by Jenny Tjan Delft, Nederland June 2016 Copyright c 2016 by Jenny Tjan. All rights reserved.

3 BSc report APPLIED MATHEMATICS Delft University of Technology Jenny Tjan Technische Universiteit Delft Thesis advisor Dr.ir. M.B. van Gijzen Other members of the graduation committee Dr. J.L.A. Dubbeldam Dr.ir. M. Keijzer June, 2016 Delft

5 Abstract Google uses the PageRank algorithm to rank the web. The algorithm models the behavior of a random surfer. It follows an outlink or it goes to any page by entering a URL into an address bar. This is also called teleportation. The probability that a surfer teleports to a page is given in the personalization vector. The PageRank algorithm returns a Pagerank score for each page. The score determines the position of the page. The higher the score, the higher the page will be on the list. However, some people want to increase their PageRank score artificially. Link spamming is the name for adding and removing links between pages with the sole purpose of increasing the PageRank score. We want to find a method to lower the effect of link spamming. One way is to change the personalization vector. If we restrict the pages the random surfer can teleport to, we can avoid having the surfer teleport to a page that is suspected of link spamming. So one way to suppress the effect of link spamming is to optimize the personalization vector. In order to combat link spamming, we have looked at the role and influence of the personalization vector. We describe two different methods to optimize the personalization vector. The first method generates a number of personalization vectors and calculates the sum of the PageRank scores of the pages that are suspected of link spamming. The lowest score belongs to an optimal personalization vector. The second method uses linear programming. We minimize the PageRank scores of the suspected pages and find the optimal personalization vector. The results were not always useful. For that reason, we added two extra requirements. One was setting an upper limit for all the pages on the probability that the surfer can teleport. The other one was suppressing the pages in the irreducible subsets. If a surfer gets into an irreducible subset, it will never leave the subset by following outlinks. 4

6 Preface Finally, this thing in your hands is the last thing I had to do to get my bachelor degree. These three years went by faster than I thought. One of the many reasons why I choose this study was wondering why you can only study mathematics in a university. And the other question was why there were two kinds. There was mathematics and applied mathematics, what was the difference? Instead of ing and asking around I enrolled into the university just to see it for myself. I can say for sure that changing studies was one of the yolo -moments of my life. The same as walking into the numerical department looking for an interesting bachelor project. Although I find probability theory very interesting, somehow I ed up here. I do not regret my decision at all and had a pretty fun time sitting on the third floor every day. If I have to thank everyone who has helped me through this bachelor journey, I will have to kill a few trees to write down all the names. Although the previous versions of this report has been printed out at least 10 times and by the time you read this, around 15 times. I m going to be environment-frily and just thank my supervisor Martin van Gijzen. He has helped me with my bachelor project by giving me advice, sharing his wisdom and checking my spelling. I especially want to thank these people for proofreading my thesis: Dyan Konijnenberg and Pim Otte Small thanks to K.P. Hart, Rowan Kerstens and Vivian van der Heul for reading and giving me feedback about the abstract, introduction and conclusion. so I know that everybody from different knowledge of mathematics to understand what this report is about. As well, Dylan Huizing for putting finishing touches and Tim Hegeman for the light bulb moment. Do not not forget Joanne Tjan for moral support and doing absolutely nothing. I hope you will enjoy reading this as much I had with writing this. Let me this with one of my favorite cheesy quote. It s not about the destination, it s about the journey. Oh, and the journey has not ed because I will return to Delft as a master student next semester. That means more joy for at least two years. Jenny Tjan,

7 Contents 1 Introduction 8 2 Preliminary mathematical definitions Notations Definitions Theorems Google Matrix Model for the random surfer Computing the PageRank vector Power Method Linear system approach Example Effect of link spamming Link spamming Suspected pages Research question Modifying the PageRank Model with v Personalization vector Influence of the personalization vector Validation Finding an optimal v Method Example with 7 nodes Method 2: Linear Programming Example Irreducible subset Method 2.1 and Method Example Summary of the results of the four methods Numerical results G Computation Time Conclusion 27 A End results 28 A.1 G A.2 G B Matlab codes 31 B.0.1 WebH.m B.1 Methods to calculate the PageRank vector B.1.1 pagerankpow.m B.1.2 IT.m

8 B.2 Methods to optimize personalization vector B.2.1 Random.m B.2.2 OPTx.m B.2.3 OPTx2.m

9 1 Introduction Millions of people use the internet to search for information every day. They usually go to a search engine to get their results. A search engine is software that is designed to search for information on the web and Google is one of the most used search engines at this moment. But why do people prefer Google over Yahoo and other search engines? The internet nowadays consists of billions of web pages and even more links connecting them. How does Google sort and return these search results? The answer to these questions is PageRank. This algorithm is used to return certain search results and staying at the top of his field. The PageRank algorithm models the behavior of a random web surfer that follows an outlink from the page with probability α or goes to another page with probability (1 α) by writing the URL in the address bar. This is also called teleportation. The surfer will teleport to any pages by entering the URL in the address bar with a probability given in the personalization vector. The mathematical way to interpret this is a random walk in a directed graph, which could also be called a Markov chain. The algorithm returns a PageRank vector with the probability, also named PageRank score, that the web surfer will be there after many steps. The result follows from the PageRank by ranking the page scores from high to low. Some people may want to increase the PageRank score of their page. One method to do so is Link spamming. Link spamming aims to fool the model by adding links to pages to increase the PageRank scores. What we want to achieve in this project, is to return a more honest search result than PageRank would. This will be done by modifying the model such that pages that are suspected of link spamming get a lower PageRank score. It is well known how to detect link spamming and we refer to the work of Sangers and van Gijzen. [3] The personalization vector is a probability vector that indicates the pages where the surfer will teleport to. This research aims to investigate and optimize the personalization vector to combat link spamming. The structure of this report is as follows. In Section 2 we explain the preliminary mathematical definitions that will be used later in the report. In Section 3 we will discuss the mathematical interpretation of the PageRank algorithm and how Google models the random surfer. In Section 4 we will further describe link spamming and its harmful effects. In the remaining sections will be about the effect of the personalization vector and how to optimize it to combat link spamming. Finally, we will discuss the methods and conclude our findings. 8

10 2 Preliminary mathematical definitions In this report there will be notations, definitions and theorems used. These will be defined in this section. 2.1 Notations Bold symbol denotes a vector and non bold symbol with index denotes his coefficient. Let π be a n n vector, then we can write π = (π 1,..., π n ) 2.2 Definitions Definition 1. Let A be a n n matrix. A is said to be a reducible matrix when there exists a permutation matrix P such that [ ] P X Y AP = 0 Z where X and Z are both square Otherwise A is said to be an irreducible matrix. Definition 2. Let l be the number of irreducible subsets of S. Then we can rewrite S in canonical form by renumbering the coefficients: A 1,1 A 1,2... A 1,r A 1,r+1 A 1,r+2... A 1,m 0 A 2,2... A 2,r A 2,r+1 A 2,r+2... A 2,m A A r,r A r,r+1 A r,r+2... A r,m A r+1,r A r+2,r A m,m where l = m r. Where each A 1,1,..., A m,m is either irreducible or [0] 1 1 and each A r+1,r+1,..., A m,m is an irreducible subset. See [5]. Definition 3. Matrix A is positive if A > 0 for every a ij A. i.e. all the elements of A are greater than zero Definition 4. Let A be a n n matrix with the corresponding λ as eigenvalues. The spectral radius of A is given by ρ(a) = max λ i i Definition 5. Given Markov chain on a state space V with transition matrix S. We call a subset C V closed subset or irreducible subset if and only if j C S ij = 0 for each i C Definition 6. Let x be a 1 n vector and A be a n n matrix. Then the 1-norm and inf-norm are defined as follows n x 1 = x i x = max x i 1 i n A 1 = max 1 j n i=1 n a ij A = max i=1 9 1 i n j=1 n a ij

11 Definition 7. The condition number of the matrix A is given by 2.3 Theorems κ(a) = A A 1 Theorem 1 (Perron-Frobenius). Let G be a n n irreducible positive matrix. Then G has a unique positive real eigenvalue λ 1 equal to its spectral radius. If G is positive then λ 1 is dominant. To λ 1 corresponds a positive eigenvector. Theorem 2 (Gershgorin circle theorem). The eigenvalues of a general matrix n n matrix A are located in the complex plane in the union of circles λ a ii j i j=1 a ij where λ C 10

12 3 Google Matrix 3.1 Model for the random surfer Let W = (V, E) denote the web graph with V the set of n web pages and E the set of direct edges between the pages. Let H be the matrix representation of W, which means H ij = 1 if there is an outlink from page i to page j, with 1 i, j n. Let the row sums denoted by r i = n j=1 H ij. If r i = 0, means page i does not have outlinks to other pages. This is called a dangling node, for example image files or word files [2]. The way of treating a dangling node is linking it to other pages with equal probability. Now we can define matrix S as follows: S ij = { Hij r i r i 0 1 n r i = 0 S is the so-called web hyperlink matrix. Note that S is a row-stochastic matrix and S is a column-stochastic matrix. The rows of S sums up to 1 and the column of S sums up to 1. Other words, S is a transition matrix if the surfer only follows the outlinks. To give a more realistic model of the random surfer we have to take into account that a surfer can go to other pages by entering the URL in the address bar. This is also called teleportation. To this, we also introduce matrix E which is the matrix that models teleportation. Let E denote the teleportation matrix defined by E = v e, v is a 1 n probability vector called personalization vector and e is a 1 n vector with consisting of only ones. v gives the probability that a random surfer jumps to a certain page. We introduce the Google matrix to model the random surfer: G = αs + (1 α)e where α is an amplification factor between 0 < α < 1, which means the surfer has probability α to follow an outlink of the page or teleport with probability (1 α). Google started to use α = 0.85 and uniform vector for v to model the surfer [2, 6]. That means if the random surfer teleports, it will teleport to any page with equal probability. Thus v = 1 [ ] n Remark, G is a column-stochastic matrix since it is a convex combination of two transition matrices [4]. Every element of G is between 0 and 1 (i.e., 0 G ij 1) and each column sums up to 1. Furthermore, V is finite and G is irreducible since it is possible to go from any page to any other page by entering the URL. Theorem 3. Let G be a n n Google matrix. Then λ 1 = 1 is the unique eigenvalue of G. To λ 1 = 1 corresponds a PageRank vector. Proof. Let e be a n 1 vector consisting only ones. We know that the columns of G sums up to 1. So we have that G e = e. Hence 1 is an eigenvalue of G. We also apply Gershgorin circle theorem to find the greatest value in the modules of the eigenvalues. ρ(λ) max j G ij = 1 Thus λ 1 = 1 is a dominant eigenvalue of the matrix G. Since G is positive and irreducible it follows from Perron-Frobenius follows that λ 1 = 1 the dominant eigenvalue of G. 3.2 Computing the PageRank vector Power Method The random surfer can take infinitely many steps without getting tired. The surfer start with a certain initial distribution: π 0 and to know where the surfer is after i steps we do the following i 11

13 calculation: π i = G i π 0. We assume G is diagonalisable, thus G = P ΛP 1. Remark that the dominant eigenvalue of G is equal to 1 and the other eigenvalues are strictly smaller than 1. So after infinitely many steps we get that the PageRank vector corresponds to the eigenvector for the dominate eigenvalue. This method is also called the power method [1, 2, 4]. The method goes is follows. We begin with a starting vector π 0, take for example π 0 = v and compute the following steps: π i+1 = Gπ i with i = 1, 2,... until it satisfies some convergence conditions. This method is the most common way to solve for large problems [3]. Remark that for every i we have that π i is a probability vector. If we rewrite the problem we get: π i+1 = Gπ i (1) = αs π i (1 α)v eπ i (2) = αs π i (1 α)v (3) So to perform a power iteration, there is only matrix multiplication needed with the matrix S while adding a scalar vector (1 α)v. The average number of outlinks for a page is 52 [2], that means that S is very sparse, i.e. S has a lot of zero elements. Although S is very large, the power method will compute easily and fast Linear system approach For smaller testproblems we compute the PageRank vector by rewriting the problem as a linear system. We know that the PageRank vector corresponds to the dominant eigenvalue, i.e. λ 1 = 1. It has to satisfy the following condition: Gπ = π. Remark eπ = 1, this could be rewritten as: 0 = π Gπ (4) = π αs π (1 α)v eπ (5) = π αs π (1 α)v (6) = (I αs )π (1 α)v (7) To find the PageRank vector, we just have to solve the following equation (I αs )π = (1 α)v (8) This consists of a system of n linear equations that needs to be solved. Note that calculating matrix vector multiplications is much faster than solving a linear system. 3.3 Example Look at the following example given by Figure 1. We have a small web that only consists of 7 pages. Let s calculate the PageRank vector for this testproblem. First we set-up webhyperlink matrix S before computing the Google matrix. Thus: S =

14 Figure 1: Testproblem with 7 nodes Notice page 6 and 7 are dangling nodes, it can be seen in S that column 6 and 7 are uniformly distributed. To calculate the PageRank vector we define the following. Let α =.85 and v be the uniform vector. It follows that G is defined by G = αs + (1 α)v e = The PageRank vector calculated by the power method is with convergence requirement π i+1 π i 10 4 : π = [ ] This was done in 32 iterations. The vector is the same PageRank vector calculated with linear method. The algorithm will return the following order page order: Page 2 Page 1 Page 3 Page 4 Page 7 Page 5 Page 6 13

15 4 Effect of link spamming The goal of the research is to suppress the pages that are suspected of link spamming. In this section, we will explain what link spamming is and how to find the suspect pages of link spamming. 4.1 Link spamming Link spamming, or link farm aims to fool the algorithm by adding and removing links to pages to increase the PageRank score of certain pages. For example, a company has a page that rarely get hits. The PageRank score of his page is very low. To increase their PageRank score, they would hire people to add links from other pages. The spammer will make sure the links are hidden in multiple sites fooling the algorithm that the page is important, so it will give the page a higher score. For this reason, Google uses other parameters and algorithms to model the random surfer which we do not know of [2]. 4.2 Suspected pages We can find the suspected pages by finding the eigenvector corresponds with the second eigenvalue, see [3, 8]. One of the effective methods of link spamming is creating irreducible subsets. Creating irreducible subsets is very effective, that is because if a surfer gets there, it cannot leave by following outlinks. Also, the probability he will leave the irreducible subset by entering the URL is small. If α = 0.85, then the probability that the surfer teleports is around One way to find the irreducible subset is to rewrite S in canonical form by renumbering the nodes. If we recall the testproblem given in Figure Figure 2: After renumbering The web hyperlink matrix S has thus the following form: [ ] S1,1 S S = 1,2 = S 2,1 S 2, Figure 2 shows there is one irreducible subset S 2,2. This could also be seen in the matrix S. If the surfer goes to page 6 or 7, it will stay there until it decides to teleport. 14

16 4.3 Research question This report aims to analyze the personalization vector. Can we use it to combat link spamming? If so, how can we optimize the personalization vector? In the following sections, we are going to look at the role and impact of the personalization vector. Then we will discuss how to find and verify the optimal personalization vector. 15

17 5 Modifying the PageRank Model with v 5.1 Personalization vector One of the first modifications done to the model is changing the teleportation matrix E = v e. One could use another probability vector instead of the uniform vector for v. Each time the surfer teleports, it will teleport to a page with a certain probability distribution given in v. There are a few reasons why Google named the personalization vector that way. One of the main reason is because Google wanted different vectors to model different types of surfers more accurately [6, 3]. It will bring that surfer to the pages where he likely will teleport to. For example, a sports person will get more sports related pages than someone who has no interest in sports. 5.2 Influence of the personalization vector We try out different probability distributions for v and calculate the PageRank vectors. We compare that to the PageRank vector calculated with the uniform vector, to see how much the PageRank vector deps on the difference in personalization vector. To bound the perturbation, we look at the condition number κ 1 (I αs ) = (I αs ) 1 (I αs ) 1 1. Theorem 4. Let S be an n n row-stochastic matrix whose elements S ii = 0. Let α be a real number such that 0 α 1. Let E be the n n rank-one row-stochastic matrix with E = v e, where e is the 1 n vector whose elements are all e i = 1 and v is an n-vector that represents probability distribution. Define the matrix G = αs + (1 α)e. The problem Gπ = π has condition number κ 1 (I αs ) = I αs 1 (I αs ) 1 1 = 1+α 1 α. Proof. To prove the theorem, we are going to determine the norm of the matrix and inverse of the matrix separately. 1. I αs 1 = 1 + α We assume the diagonal elements of αs are zeros: I αs 1 = I 1 + α S 1 = 1 + α S 1 = 1 + α (9) 2. (I αs ) 1 1 = 1 1 α Let e i be an unit vector with 1 on the position i. Let π (e i ) be the n-vector that satisfies the following: v = e i, (I αs )π(e i ) = π (e i ), π (e i ) 1 = 1 Remark from the equation 8: π (e i ) = (1 α)(i αs ) 1 e i. Thus by taking the norm we get: π(e i ) 1 = (1 α) (I αs ) 1 e i 1 (10) 1 = (1 α) (I αs ) 1 e i 1 (11) (I αs ) 1 e i 1 = 1 1 α (12) Notice for each column we get (I αs ) 1 e i 1 = 1 1 α. Hence (I αs ) 1 1 = 1 1 α 16

18 The condition number is therefore κ 1 (I αs ) = I αs 1 (I αs ) 1 1 = 1+α 1 α For the complete proof we refer to Kamvar, Sepandar, and Taher Haveliwala [11]. For α = 0.85 the condition number κ Since v is probability vector, we get v 1 = v + v v 1 v + v 1 + v 1 = 2 Which means, the upper bound of the difference of the PageRank vector π Remark that the difference in the PageRank vector is also π 1 = π + π π 1 π + π 1 + π 1 = 2 since π is a probability vector. We will look at the linear equation to get a better upper bound. To find a better bound we look at the equation 8 Let π be the PageRank vector corresponding with to the following personalization vector v. Then, π = (1 α)(i αs ) 1 v. Let π = π + π be the disturbed PageRank vector corresponding to the following personalization vector v = v + v : π = (1 α)(i αs ) 1 (v + v ) Note the following: π 1 = π + π 1 = v 1 = v + v 1 = 1. From equation 8 we get: π 1 (1 α) (I αs ) 1 1 v 1 (13) π 1 π 1 (1 α) (I αs ) 1 1 v 1 v 1 (14) π 1 π 1 v 1 v 1 (15) π 1 v 1 (16) Notice step 15 follows from theorem 4.2. We found an better bound for the overall disturbance in π. This will all be tested in subsection We also want to know how much the difference in personalization vector affects the largest value of the PageRank vector. We can look at the following inequality: π π 1 Thus, we know that π is also bounded by the same bound as π 1. Thus Validation π π 1 v 1 To test the claim stated above, we use the test problem defined in Figure 1. We test the two upper bounds for the 1-norm of the PageRank vector π 1. Let v u be the uniform vector to the PageRank vector π u. We generate 1000 random personalization vectors {v i } for 1 i 1000 and define the difference as v i = v u v i. For each personalization vector we find the PageRank vector π i and define the difference π i = π u π i. Then we plot π i 1 to v i 1 and to κ 1 v i 1 that can be seen in Figure 3. 17

19 (a) π 1 κ 1 v 1 (b) π 1 v 1 Figure 3: Upper bound for 1 As can be noticed, both bounds holds the inequality. The second upper bound is better than the upper bound given with the condition number. We see in Figure 3a that the difference between the π 1 and κ 1 v is quite large. Now we are going to test the bound for the maximum value of the PageRank vector, i.e. π v 1. We plot π i to v i 1, the result can be seen in Figure 4 Figure 4: π v 1 By looking at the result, we can conclude the following. If we want a small difference in the PageRank vector, it is obvious to choose another probability vector that does not differ much from the uniform personalization vector. However, if we choose a personalization vector that differs a lot from the uniform vector, it does not lead to a large difference in the PageRank vector. 18

20 6 Finding an optimal v Let C be the set of pages that should be suppressed. Let v be a personalization vector with π as PageRank vector. Let β denote the sum of the PageRank score of the suppressed pages, thus β = i C π i. We call the vector v the optimal personalization vector if its subject to β = min v i C π i v is personalization vector In this chapter, we will discuss two main methods and some variation of the last method to find the optimal personalization vector v. 6.1 Method 1 We do not want the random surfer to teleport to one of the suspected pages. So we generate a set of m personalization vector such that v i = 0 for every i C. Let e j be a unit 1 n vector with 1 on the j th position. Then we define for every personalization vector v k = j / C γ k,je j with j / C γ k,j = 1 and 0 γ k,j 1 for every 1 k m. We calculate the corresponding PageRank vector and calculate β k = i C π k,i. The method returns v t for a fixed t if β t β k for every 1 k m Example with 7 nodes Let us reformulate the test problem. We alter the web structure by adding in- and outlinks to increase the PageRank of page 4. Also, we set the matrix S in canonical form. Hence, we have the following directed graph that can be seen in Figure 5: Figure 5: After adding and removing links and renumbering By renumbering, the web hyperlink matrix looks as follows: S = It can be seen in the figure that the test problem has two irreducible closed subsets. If the surfer gets into a closed subset, it can never leave the subset by following an outlink. Let α =.85 and v be an uniform vector. The calculated PageRank vector would be: π u = [ ] 19

21 Let s take for example we want to suppress page 4 since the PageRank score of page 4 is the highest. For the first method we take m = We call π m1 the PageRank vector and v m1 the personalization vector found with this method. As result we find : v m1 = [ ] π m1 = [ ] For this method, we have to generate m personalization vectors and calculate m PageRank vectors. Google takes days to calculate one PageRank vector and finding personalization vector with this method is space and time consuming. We will introduce another method in the next paragraph. 6.2 Method 2: Linear Programming Instead of looking at different personalization vectors, one could look at the requirement of the PageRank scores. Recall we defined β = min v i C π i for v is a personalization vector. So if we set π i = 0 for i C, we get the lowest value for β. But it does not always meet the condition of v being a probability vector, i.e. v 0 and v 1 = 1. We can formulate the problem as a linear optimization problem. We start from the linear system (I αs )π = (1 α)v as can be seen in equation 8. min cπ (I αs subject to ) 1 α π 0 i π i = 1 0 π i 1 1 i n with the following: c is a n 1 vector, with c i = 1 if i C. It should also subject to v 0 and that the sum of the coefficients of π must be equal to 1, since π is a probability vector. Also, all the coefficients of π should be between 0 and Example Given the problem defined in Example We use method 2 to calculate the PageRank vector and corresponding personalization vector. We call π m2 the PageRank vector and v m2 the personalization vector found with this method. We want to suppress page 4, thus we have to solve the next problem: min π 4 (I αs subject to ) 1 α π 0 i π i = 1 0 π i 1 1 i 7 We get the next result: π m2 = [ ] v m2 = [ ] Note that the solution is not unique. If we choose v = [ ]. The PageRank score of page 4 will still be 0. In other words, the optimal personalization vector is not unique. 20

22 6.3 Irreducible subset The result we obtained with method 2 does lower PageRank score for page 4, but the result is not the most ideal. It is quite logical for the random surfer to jump to an irreducible closed subset. A surfer cannot reach page 4 if he gets stuck and stays stuck. If we decrease the upper bound of the linear problem, this will force the surfer to teleport to other pages as well. 6.4 Method 2.1 and Method 2.2 We change method 2 in two different ways to get a better result. We call the first modification method 2.1. The method will prevent the random surfer from getting stuck in an irreducible subset. A downside of it is that we do not know by how much we should lower the bounds. If we lower the bounds too much, there is a chance we will not find a solution for v. Another method is to find all the pages that are part of irreducible subsets, we call this method 2.2. We minimize the PageRank score of those pages and also the pages we want to suppress originally. To find the pages of the irreducible subsets we can apply Tarjan SCC algorithm. The MATLAB code of the Tarjan SCC algorithm can be found in [13] 6.5 Example To illustrate the methods described before we use the example defined in Example Define the Google matrix of this problem as G 7. We lower the upper bound to 0.2 instead of 1 for illustration. We call π m21 the PageRank vector and v m21 the personalization vector found with this method. The result would be: π m21 = [ ] v m21 = [ ] Notice the values of the PageRank scores are more realistic, the surfer can teleport to more pages than before and the PageRank score of page 4 is less than the value of π u,4. So in the, we did lower the PageRank score. To try the other method we first look for the pages of the irreducible subsets. As can be seen in matrix S, those pages are I = {4, 5, 6, 7}. We call π m22 the PageRank vector and v m22 the personalization vector found with this method. By minimizing those PageRank scores we get: π m22 = [ ] v m22 = [ ] Notice that personalization vectors from every method meet the requirement that v ml,4=0 for l = {1, 2, 21, 22}. 21

23 6.6 Summary of the results of the four methods To get a better visualization of the results, we order the pages based on their PageRank score. Recall the following: v u is the uniform personalization vector to the PageRank vector π u v m1 is the personalization vector to the PageRank vector π m1 found with method 1. See Section 6.1 for more information. v m2 is the personalization vector to the PageRank vector π m2 found with method 2.. See Section 6.2 for more information. v m21 is the personalization vector to the PageRank vector π m21 found with method 2.2. See Section 6.4 for more information. v m22 is the personalization vector to the PageRank vector π m22 found with method 2.1. See Section 6.4 for more information. The result can be seen in the Table 1, where the best search result is shown at the top of each column. π u π m1 π m2 π m21 π m22 page 4 page 6 page 7 page 7 page 4 page 5 page 7 page 6 page 6 page 5 page 7 page 5 page 4 page 4 page 2 page 6 page 4 page 5 page 5 page 7 page 1 page 3 page 2 page 1 page 6 page 2 page 2 page 3 page 2 page 3 page 3 page 1 page 1 page 3 page 1 Table 1: End result If we take a closer look at the table, we notice that the position of page 4 has dropped for the first three methods. If we only look at the results, we would suspect that method 2 with also minimizing the irreducible subset, would not work so well. It did not decrease the position and the order is also different than the first column. 22

24 7 Numerical results In this experiment, we examine the numerical results of the algorithms used in the previous sections. We will check whether the methods work and whether they lower the PageRank score of the page(s). Since the most methods work for small test problems, we want to know how well this works with other sizes of test problems. In Section 7.1, we test it the methods for two matrices of different size: G 500 and G G 500 is 500 by 500 matrix proposed by Moler and Cleve [1] and G 9914 is a 9914 by 9914 matrix proposed by Gleich [12]. Later on, we show the computation time of the methods. We show the link structure of the web using the MATLAB code spy on the two test problems. This can be seen in Figure 6. (a) Spy plot of G 500 (b) Spy plot of G 9914 Figure 6: Spy plots 7.1 G 500 We only show the test results of G 500. The computational time of every method can be found in Section 7.2. First, we look at order of the pages corresponding to the uniform vector. This can be seen in Table 2 Page PageRank score URL Table 2: PageRank score of G 500 We tried method 2 to optimize the result. The result is given in Table 3. Remark that page 1 is suppressed, but the top 7 pages are different then the order the uniform vector returns. Another thing is, the PageRank score of the pages are not very realistic (page 132 score went up by factor 517). 23

25 Page PageRank score URL Table 3: PageRank scores by using method 2 To get a more realistic PageRank vector, we also minimize the irreducible subsets. After applying Tarjan SCC algorithm, we get as result I = {132, 161}. Note that the pages are also the pages that are on top after applying method 2. The result after suppressing the following pages {1, 132, 161}, this can be seen in Figure 4. Page PageRank score URL Table 4: PageRank scores by using method 2.2 As a result, we get the pages that are closer to page 500 than page 1. The reason is because the pages further away from 1 have less connection than the pages with a small value. This is what we could have expected. We set the PageRank scores in a plot for a better overview. This can be seen in Figure 7. (a) Original (b) PageRank after using method 2 (c) PageRank after using method 2.2 Figure 7: PageRanks of G 500 Notice that the PageRank scores are more realistic when suppressing pages in irreducible subsets. With both methods, the PageRank score of page 1 has dropped, which is the result we aimed for. The results of β could be seen in Table 5. The factor is calculated by β mj β u for j = {1, 2, 21, 22}. 24

26 We also denote the C i as the collection of pages that we want to lower the PageRank score of: The result can be seen in Figure 5. C 1 = {4} C 2 = {1} C 3 = {252} C 4 = {252, 351, 523, 123, 2345, 1553, 9862} Set Matrix Uniform Method 1 Method 2 β u β m1 factor β m2 factor C 1 G C 2 G C 3 G E C 4 G Set Matrix Uniform Method 2.1 Method 2.2 β u β m21 factor β m22 factor C 1 G C 2 G C 3 G C 4 G Table 5: β: the sum of the suppressed pages Remark that for small test problems, the method does not always lower the PageRank score. For larger test problems, however, it does. The rest of the numerical results of G 500 and G 9914 can be found in appix A. 7.2 Computation Time To time the results we make use of the MATLAB code tic toc. The program used was MAT- LAB and the computer used to obtain the results was: MacBook Pro with 2.5 GHz Intel Core i5 and 16 GB 1600 MHz DDR3 The MATLAB code for the next methods are: Method 1 : Random.m Method 2 : OPTx.m Method 2.2 : OPTx2.m Remark that Method 2 and Method 2.1 are the same algorithms. For method 1 we generate 1000 personalization vectors and for method 2.1 we choose as upper bound 0.2. We also explain the abbreviations: Req 1: Requirement that the calculated personalization vector is a probability vector, i.e. v sums up to 1 and v 0 Req 2: Requirement that the calculated PageRank vector is a probability vector, i.e. π sums up to 1 The results can be seen in Table 6. Remark that the first method does not always return a probability vector. This could be seen in G That is because the method used to calculate the PageRank vector is not very accurate. The coefficients of a vector are small, which makes it sensitive to rounding errors. This could be solved if we rescale the PageRank such that π 1 = 1. 25

27 Algorithm G 7 G 500 G 9914 Time Req 1 Req 2 Time Req 1 Req 2 Time Req 1 Req 2 Random.m YES YES YES YES 1677 NO YES OPTx.m YES YES YES YES YES YES OPTx2.m YES YES YES YES YES YES Table 6: Numerical Results The question was whether we could optimize the personalization vector so that we can combat link spamming. It is possible to suppress the pages that are suspicious. The disadvantage of the discussed methods is that there is nothing left of the original order. The results from the methods are different and unpredictable. Using personalization vectors to combat link spamming with these methods is not the optimal way. 26

28 8 Conclusion In the introduction, we introduced the PageRank algorithm Google uses to return the search results. People who want their page to have a higher PageRank score could fool the algorithm by using link spamming. In this report, we discussed link spamming and looked at whether we could combat it by modifying the model, more specifically by changing the personalization vector. First, we wanted to know how large the impact of the personalization vector on the PageRank vector was. We have seen that the difference in PageRank vector is bounded by π, π 1 v 1. By looking at the plots given in Figure 3 and Figure 4, we noticed that a large difference in personalization vectors does not necessarily lead to a large difference in the PageRank vector. An optimal personalization vector should meet the requirement that v i = 0 if page i is suspected of link spamming. There were two methods to find an optimal v. One way is generating a number of personalization vectors and choosing the personalization vector that is optimal. The downside of this method is that it takes too much time and space for large matrices. The other method was using linear programming to find the PageRank vector of the optimal personalization vector. The result was not always realistic. We remodeled the method to get better results. It is more desirable to suppress the effect of link spamming with the personalization vector, but the results are not predictable. It would be preferable to suppress the pages and still maintain the original order of the search results. In this report, we focused on the personalization vector to combat link spamming. The Page- Rank model has other elements we can modify. One of them is the dangling node fix. The dangling node fix is applied by linking the page to all other pages. The dangling node fix could be adjusted to make it link to certain pages only, i.e. those that should not be suppressed. That way the PageRank score of the normal pages are increased in comparison to the pages that we want to suppress. This approach would be better to combat link spamming. This is a recommations for future research. There are also other ways to combat link spamming with personalization vector, such as those of Jeh, Glen and Widom [9] and Kamvar, Sepandar et al. [10] These other methods have not been reviewed or compared to the methods in this paper. 27

29 A End results A.1 G 500 The page(s) we wanted to suppress was: 1 This plot represent the PageRank vector calculated with the uniform personalization vector. Figure 8: Original The next plots represent the PageRank vector calculated with the personalization vector found with the different methods. Hereafter we will show the PageRank vector after applying some modifications to the second method. (a) PageRank after using method 2 (b) PageRank after using method 2.2 (c) PageRank after using method 2.1 (d) PageRank after using method 2.2 Figure 9: PageRanks of G

30 A.2 G 9914 The page(s) we wanted to suppress was: 252 This plot represent the PageRank vector calculated with the uniform personalization vector. Figure 10: Original The next plots represent the PageRank vector calculated with the personalization vector found with the different methods. (a) PageRank after using method 2 (b) PageRank after using method 2.2 (c) PageRank after using method 2.1 (d) PageRank after using method 2.2 Figure 11: PageRanks of G

31 The page(s) we wanted to suppress was: {252, 351, 523, 123, 2345, 1553, 9862} The next plots represent the PageRank vector calculated with the personalization vector found with the different methods. (a) PageRank after using method 2 (b) PageRank after using method 2.2 (c) PageRank after using method 2.1 (d) PageRank after using method 2.2 Figure 12: PageRanks of G

32 B Matlab codes B.0.1 WebH.m function [ P ] = WebH(A) % Generates web h y p e r l i n k matrix. % I f A i s a 2xn matrix with the f i r s t row the o u t p u t s % and second row t he i n p u t s then i t should be made i n t o a % c o n n e c t i v i t y matrix. % Check to see i f i t i s a c o n n e c t i v i t y m a t r i x, or should be made % i n t o a c o n n e c t i v i t y matrix i f max(max(a))==1 n=size (A, 1 ) ; S=A; else n=max(max(a) ) ; S=sparse (A( 1, : ),A( 2, : ), 1, n, n ) ; %Webhyperlink matrix c=sum( S ) ; %f o r r e s c a l i n g for i =1:n i f c ( i )>0 P ( :, i )=S ( :, i )/ c ( i ) ; else P ( :, i )=S ( :, i )+1/n ; %d a n g l i n g nodes P=sparse (P ) ; B.1 Methods to calculate the PageRank vector B.1.1 pagerankpow.m function [ x, cnt ] = pagerankpow (G) % PAGERANKPOW PageRank by power method with no matrix o p e r a t i o n s. % x = pagerankpow (G) i s t h e PageRank o f t h e graph G. % [ x, cnt ] = pagerankpow (G) a l s o counts the number o f i t e r a t i o n s. % There are no matrix o p e r a t i o n s. Only the l i n k s t r u c t u r e % o f G i s used with the power method. % Link s t r u c t u r e [ n, n ] = size (G) ; for j = 1 : n 31

33 L{ j } = find (G( :, j ) ) ; c ( j ) = length (L{ j } ) ; % Power method p =. 8 5 ; d e l t a = (1 p )/ n ; x = ones (n, 1 ) / n ; z = zeros (n, 1 ) ; cnt = 0 ; while max( abs ( x z ) ) >.0001 z = x ; x = zeros (n, 1 ) ; for j = 1 : n i f c ( j ) == 0 x = x + z ( j )/ n ; else x (L{ j }) = x (L{ j }) + z ( j )/ c ( j ) ; x = p x + d e l t a ; cnt = cnt +1; B.1.2 IT.m function [ x, counter ] = IT (S, v, p ) % Powermethod to c a l c u l a t e pageranks v e c t o r n=size (S, 1 ) ; counter =0; x=1/n ones (n, 1 ) ; y=zeros (n, 1 ) ; while max( abs ( x y )) > y=x ; x=p S x+(1 p ) v ; counter=counter +1; B.2 Methods to optimize personalization vector B.2.1 Random.m function [ xm1, vm1 ] = Random(K, set, l ) % Random g e n e r a t i n g numbers a few times to p i c k the b e s t 32

34 % combination f o r pers. v e c t o r s u p r e s s i n g the s e t p =.85; set=sort ( set ) ; n=size (K, 2 ) ; m=size ( set, 2 ) ; xm1=pagerankpow (K) ; xm=sum(xm1( set ) ) ; vm1=1/n ones (n, 1 ) ; for j =1: l ran =1/(n m 1) rand(n m 1,1); ran (n m)=1 (sum( ran ) ) ; vc=ones (n, 1 ) ; vc ( set )=0; c =1; for l =1:n i f vc ( l ) =0 vc ( l )=ran ( c ) ; c=c +1; y=it (K, vc, p ) ; i f sum( y ( set ))<xm xm=sum( y ( set ) ) ; vm1=vc ; xm1=y ; B.2.2 OPTx.m function [ x, v ] = OPTx(K, alpha, c, lower, upper) % Optimize pagerank v e c t o r % K : Webhyperlink matrix % c : 1xn v e c t o r with c ( i )=0 i f page i should be minimized % alpha : a m p l i f i c a t i o n f a c t o r % lower, upper : nx1 v e c t o r o f bounds f o r x ( i ) n=size (K, 1 ) ; b=zeros (n, 1 ) ; 33

35 p =0.85; I = speye (n, n ) ; A = (I alpha sparse (K))/(1 alpha ) ; Aeq=ones ( 1, n ) ; beq =1; x = l i n p r o g ( c,a, b, Aeq, beq, lower, upper ) ; v = ( I p K) x/(1 p ) ; B.2.3 OPTx2.m function [ x, v ] = OPTx2(K, alpha, c, lower, upper) % Optimize pagerank v e c t o r % K : Webhyperlink matrix % c : 1xn v e c t o r with c ( i )=0 i f page i should be minimized % alpha : a m p l i f i c a t i o n f a c t o r % lower, upper : nx1 v e c t o r o f bounds f o r x ( i ) n=size (K, 1 ) ; b=zeros (n, 1 ) ; p =0.85; [ node, i g n o r e ]= scomponents (K) ; %f i n d i r r e d u c i b l e s u b s e t s l =1; for i =1:n i f node ( i ) =1 set ( l )= i ; l=l +1; c ( set )=1; I = speye (n, n ) ; A = (I alpha sparse (K))/(1 alpha ) ; Aeq=ones ( 1, n ) ; beq =1; x = l i n p r o g ( c,a, b, Aeq, beq, lower, upper ) ; v = ( I p K) x/(1 p ) ; 34

36 References [1] Moler, Cleve B. Numerical Computing with MATLAB: Revised Reprint, Chapter 7 Google PageRank. Siam, [2] Wills, Rebecca S. Googles pagerank. The Mathematical Intelligencer 28.4 (2006): [3] Sangers, Alex, and Martin B. van Gijzen. The eigenvectors corresponding to the second eigenvalue of the Google matrix and their relation to link spamming. Journal of Computational and Applied Mathematics 277 (2015): [4] Langville, Amy N., and Carl D. Meyer. Google s PageRank and beyond: The science of search engine rankings., Chapter 6. Princeton University Press, [5] Meyer, Carl D. Matrix analysis and applied linear algebra, Chapter 8. Vol. 2. Siam, [6] Langville, Amy N., and Carl D. Meyer. Deeper inside pagerank, Chapter 6 Tinkering with the Basic PageRank Model. Internet Mathematics 1.3 (2004): [7] Ipsen, Ilse CF, and Rebecca S. Wills. Mathematical properties and analysis of Googles Page- Rank. Bol. Soc. Esp. Mat. Apl 34 (2006): [8] Haveliwala, Taher, and Sepandar Kamvar. The second eigenvalue of the Google matrix. Stanford University Technical Report (2003). [9] Jeh, Glen, and Jennifer Widom. Scaling personalized web search. Proceedings of the 12th international conference on World Wide Web. ACM, [10] Kamvar, Sepandar, et al. Exploiting the block structure of the web for computing pagerank. Stanford University Technical Report (2003). [11] Kamvar, Sepandar, and Taher Haveliwala. The condition number of the PageRank problem [12] Gleich, David F. Stanford CS web. matrices/gleich/wb-cs-stanford.html 2001 [13] Gleich, David F. gaimc : Graph Algorithms In Matlab Code. matlabcentral/fileexchange/24134-gaimc---graph-algorithms-in-matlab-code/ content/gaimc 2009/scomponents.m 35

Calculating Web Page Authority Using the PageRank Algorithm

Calculating Web Page Authority Using the PageRank Algorithm Jacob Miles Prystowsky and Levi Gill Math 45, Fall 2005 1 Introduction 1.1 Abstract In this document, we examine how the Google Internet search engine uses the PageRank algorithm to assign quantitatively