CS54701 Information Retrieval Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU)
Citation Analysis
Web Structure Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge What does it look likes? Web is small world The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19
Bowtie Structure Strongly Connected Component Broder et al., 2001
Bowtie Structure Sites that link towards the center of the web Broder et al., 2001
Bowtie Structure Sites that link from the center of the web Broder et al., 2001
Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law Broder et al., 2001
Early Approaches Basic Assumptions Hyperlinks contain information about the human judgment of a site The more incoming links to a site, the more it is judged important Bray 1996 The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites
HITS - Kleinberg s Algorithm HITS Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites
Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
Authority and Hubness: Version 1 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) w c h [ v ] HubsAuthorities(G) 1 1 [1,,1] Є R 2 a h 1 0 0 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h (v) Σ a (w) t w Є pa[v] 8 t t + 1 t -1 9 until a a + h h < ε t t -1 t t -1 10 return (a, h ) t t V w Є pa[v] t -1
Authority and Hubness: Version 1 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) w c h [ v ] HubsAuthorities(G) 1 1 [1,,1] Є R 2 a h 1 0 0 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h (v) Σ a (w) t w Є pa[v] 8 t t + 1 t -1 9 until a a + h h < ε t t -1 t t -1 10 return (a, h ) t t V w Є pa[v] t -1 Problems?
Authority and Hubness: Version 2 R e c u rs iv e d e p e n d e n c y a ( v ) h ( w ) w p a [ v ] h ( v ) a ( w ) a( v) h( v) w c h [ v ] + Normalization a( v) w a( w) h( v) w h( w) HubsAuthorities(G) 1 1 [1,,1] Є R 2 a 0 h 0 1 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t 7 h t (v) Σ w Є pa[v] a (w) t -1 8 a t a t / a 9 h t h t / h 10 t t + 1 11 until a t a t -1 + h t h t -1 < ε 12 return (a, h ) t t V w Є pa[v] t -1
HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights
Authority and Hubness Authority score Not only depends on the number of incoming links But also the quality (e.g., hubness) of the incoming links Hubness score Not only depends on the number of outgoing links But also the quality (e.g., hubness) of the outgoing links
Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise M =
Authority and Hub Vector a: a i is the authority score for the i-th site Vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v]
Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] a M T h h M a
Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site M i, j Recursive dependency: Matrix M: 1 th e ith site p o in ts to th e jth site 0 o th e rw ise Normalization Procedure a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] T t t t a M h h M a t t t
Authority and Hub T t t t Apply SVD to matrix M T t t t t a M h a M M a h M a h M M h T t t t t t t t T i T i i i M UΣ V u v a u 1, h v 1
PageRank Introduced by Page et al (1998) The weight is assigned by the rank of parents Difference with HITS HITS takes Hubness & Authority weights The page rank is proportional to its parents rank, but inversely proportional to its parents outdegree
Matrix Notation M i, j 1 th e ith site p o in ts to th e jth site 0 o th e rw ise B i, j j 1 M i, j i, j 0 0 o th e rw is e j M M = B =
Matrix Notation : rep resen ts th e ran k sco re fo r th e i-th w eb p ag e r r i Finding Pagerank to find principle eigenvector of B r = α B T r α : eigenvalue r : eigenvector of B
Matrix Notation
Random Walk Model Consider a random walk through the Web graph?? B =???
Random Walk Model Consider a random walk through the Web graph B =
Random Walk Model Consider a random walk through the Web graph B =
Random Walk Model Consider a random walk through the Web graph B = T, what is portion of time that the surfer will spend time on each site?
Random Walk Model Consider a random walk through the Web graph B = p( k) : p e rc e n ta g e o f tim e th a t th e su rfe r w ill sta y a t th e i-th site p k ( ) p ( i ) B i, k i T p B p
Adding Self Loop Allow surfer to decide to stay on the same place B = B ' B (1 ) I
Problem Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph B = 0 0 0 0 0 0 0 0 0 0 0 1 0 0 r(new page) = 0
Problem Rank Sink Problem In general, many Web pages have no inlinks/outlinks It results in dangling edges in the graph B = 0 0 0 0 0 1 0 0 0 0 0 0 0 0 r(new page) = 1
Distribution of the Mixture Model H i, j 1/ n B ' H (1 ) B r B r ' T Prevent the page ranks to be either zeros or one.
Stability Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don t change significantly? The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms?
Stability of HITS Ng et al (2001) A bound on the number of hyperlinks k that can be added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ1 λ2 d: maximum outdegree of G
Stability of PageRank r r 2 r ( j ) jv Ng et al (2001) V: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small rank, the overall change will also be small