Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1
What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities method
Up until now... Documents assumed to be equal Usefulness for ranking only affected by matching of query terms (and length) E.g., assumed P(R d) uniform in probabilistic methods Can we do better than this?... some documents are authoratitive and should be ranked higher than others.
The web as a graph Pages on the Web do not standalone Treatment as independent documents is over-simplification Considerable information in hyperlink structure Paraguay jaguar Mexico lion predator tiger
What information does a hyperlink convey? Directs user s attention to other pages Conferral of authority (not always!) Anchor text to explain why linked page is of interest IBM computers links to www.ibm.com search portal links to www.yahoo.com click here links to Adobe Acrobat evil empire links to... Additional source of terms for indexing Perhaps the most important pages have more incoming links?
The web as a directed graph Formally, consider in-links Number of incoming edges out-links Number of outgoing edges connected components Path connects all pairs of nodes Paraguay jaguar Mexico lion predator tiger
Not all links are equal Who and what to trust? outgoing links from reputable sites should carry more weight than user-generated content and links from unknown websites Web has bow-tie structure, comprising in pages that only have outgoing edges to strongly connected component whose pages are highly interlinked, and also link to out pages that only have incoming edges Typically don t consider internal links within a web-site (why?)
Page rank Assumptions links convey authority of the source page pages with more in links from authorative sources are more important how to formalise this in a model? Random web surfer Consider a surfer who visits a web page then follows a random out link, uniformly occasionally teleports to a new random page (types a new URL) Inference problem: what happens over time? Where does the surfer end up visiting most often?
Example graph 1 2 3 Transition probabilities (no teleport for now) P(1 1) = 0 P(1 2) = 1 P(1 3) = 0 P(2 1) = 1 2 P(2 2) = 0 P(2 3) = 1 2 P(3 1) = 0 P(3 2) = 1 P(3 3) = 0 Example from MRS, Chapter 21.
Example graph 1 2 3 Represent as matrix, P ij = P(i j). I.e., 0 1 0 P = 1 1 2 0 2 0 1 0 Note that P is the adjacency matrix A ij = edge exists i j, normalised such that each row sums to 1.
Adding Teleportation If at each time step we randomly jump to another node in the graph with probability α scale our original P matrix by 1 α add α N to the resulting matrix Overall our transition matrix is P ij = (1 α) A ij j A ij + α 1 N For the example with α = 0.5 P = 1 6 5 12 1 6 2 3 1 6 2 3 1 6 5 12 1 6
Markov chain Formally we have defined a Markov chain a discrete time stochastic process consists of N states, one per web page, denoted x (a row vector) starting at page i, then use 1-hot representation i.e., x i = 1 and x j = 0, j i Assumptions prob of reaching a state is based only on previous state p(x t x 1, x 2,..., x t 1 ) = p(x t x t 1 ) this is characterised by the transition matrix
Transitions as matrix multiplication Probabilty chain rule can be expressed using matrix multiplication xp = [ P(1) P(2) P(3) ] P(1 1) P(1 2) P(1 3) P(2 1) P(2 2) P(2 3) P(3 1) P(3 2) P(3 3) P(1)P(1 1) + P(2)P(2 1) + P(3)P(3 1) = P(1)P(1 2) + P(2)P(2 2) + P(3)P(3 2) P(1)P(1 3) + P(2)P(2 3) + P(3)P(3 4) = [P(X t+1 = 1) P(X t+1 = 2) P(X t+1 = 3)] where P(1) = P(X t = 1) and P(2 3) = P(X t+1 = 3 X t = 2).
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0]
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167]
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083]
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]... xp 99 = [0.2778 0.4444 0.2778]
Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]... xp 99 = [0.2778 0.4444 0.2778] xp 100 = π = [0.2778 0.4444 0.2778]
Markov chain convergence Run sufficiently long, state membership converges reaches a steady-state, denoted π transitions from this state leave the state unmodified state frequencies encode frequency of visiting each page for random surfer in the limit as t When will the Markov Chain converge? Must have the property of ergodicity For any start state i, all states j must be reachable with non-zero probability for all t > T 0, for constant T 0. Ergodicity in turn requires irreducibility (reachability between i and j) and aperiodicity (relating to partioning into sets with internal cycles).
Computing PageRank Definition of a steady-state is πp = π I.e., once in steady-state, remain in this state after transition This is a classic linear algebra problem (finding the left eigenvalues), of the form πp = λ π Can recover several solution vectors for different values of λ We want the principle eigenvector, for which λ = 1 In practice, may use the power iteration method to handle large graphs
Iteration method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> pi0= [1/3 1/3 1/3]; >> pi0 * P ans = 0.2500 0.5000 0.2500 >> pi1 = pi0 * P pi1 = 0.2500 0.5000 0.2500 >> pi2 = pi1 * P pi2 = 0.2917 0.4167 0.2917 >> pi3 = pi2 * P pi3 = 0.2708 0.4583 0.2708 >> pi4 = pi3 * P pi4 = 0.2812 0.4375 0.2812
Eigenvalue method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> [V,D,W] = eig(p); >> pi = W(:,1); >> pi = pi/sum(pi); >> pi ans = 0.2778 0.4444 0.2778 >> pi * P ans = 0.2778 0.4444 0.2778
Hubs and authorities (HITS) Assumes there are two kinds of pages on web authorities providing authoratitive and detailed information Australian Tax Office Bureau of Meteorology Wikipedia page on Australian Cricket Team hubs containing mainly links to lots of pages about a topic DMOZ web directories Someone s Pinterest page Wikipedia disambiguation pages Depending on our query, we might want one or the other broad topic query, e.g., information about biking in Melbourne specific query, e.g., is the Yarra trail sealed?
Hubs and authorities Circular definition A good Hub links to many authorities A good Authority is linked to from many hubs. Define h hub scores for each web page a authority scores for each web page Mutually recursive definition h i i j a j a i j i h i for all pages i. Gives rise to iterative algorithm for finding h and a.
Computing Hubs and Authorities Define A adjacency matrix, as before A ij = 1 denotes edge i j Leads to the relations h A a a A h Combining the definitions for h into a, a A A a which is another eigenvalue problem. The principle eigenvalue of A A can be used to solve for a, provided there is a steady-state solution (find h in similar way).
Summary: PageRank and Hubs and Authorities Both static query-independent measures of web page quality Can be run offline to score each web page Based on latent (unobserved) quality metric for each page single importance score hub and authority scores Plus transition mechanism Gives rise to document rankings
PageRank and HITS in a retrieval system How can we use these scores in a retreival system? Alongside our other features, e.g., TF*IDF, BM25 factors, LM Express model as a combination of factors, e.g., RSV d = I α i h i ( f :, f d,:,...) i=1 PR/H&A become additional features, each with their own weight α Learn weighting using machine learned scoring function to match binary relevance judgements based on click throughs or query reformulations These methods can be exploited, e.g., link spam, Google bombs, Google bowling etc.
Looking back and forward Back Link structure of the web gives rise to a graph Structure of the graph conveys information about importance of pages PageRank models random surfer using Markov Chain HITS models hubs versus authority nodes Solve for steady-state using power iteration or eigenvalue solver
Looking back and forward Forward Natural language processing, looking in more detail into the structure of text Starting with text classification
Further reading (Review of Eigen decompositions) 18.1, Linear algebra review of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. Chapter 21, Link Analysis of Manning, Raghavan, and Schutze, Introduction to Information Retrieval.