Matching. Lecture 13 Link Analysis ( ) 13.1 Link Analysis ( ) 13.2 Google s PageRank Algorithm The Top Ten Algorithms in Data Mining

Lecture 13 Link Anlsis () 131 13.1 Serch Engine Indexing () 132 13.1 Link Anlsis () 13.2 Google s PgeRnk Algorith The Top Ten Algoriths in Dt Mining J. McCorick, Nine Algoriths Tht Chnged the Future, Princeton Universit Press. 2 nd chpter 13.3 Efficient Coputtion of PgeRnk b MpReduce () Slides provided b J. Leskovec, A. Rjrn, J. Ulln, vilble t http://www.ds.org Mtching 133 Web s Directed () Grph () 134 Quer ct Pges 1, 3 Quer ct dog Pges 3 Quer ct st : Wordloction trick Ct: 1-2, 3-2 St: 1-3, 3-7 Answer: Pge 1

How to orgnize the Web? (1) First tr: Hun curted Web directories () Yhoo, DMOZ, LookSrt Ter sp: Add ter ovie 1000 ties t the botto () of the pge. Chnge the color of the text to the bckground ( ) color of the pge, ke it rell sll. 135 How to orgnize the Web? (2) Second tr: Web Serch Infortion Retrievl investigtes: Find relevnt docs in sll nd trusted set Newspper rticles, Ptents, etc. But: Web is huge, full of untrusted docuents, rndo things, web sp, etc. Ide: Links s votes Pge is ore iportnt if it hs ore links: In-coing links? Out-going links? www.stnford.edu 2,3400 in-links, www.joe-schoe.co 1 in-link 136 13.2 Google s PgeRnk Algorith (1/2) References: http://pr.efctor.de/e-pgernk-lgorith.shtl L. Pge, S. Brin, R. Motwni, nd T. Winogrd, The PgeRnk Cittion Rnking: Bringing Order to the Web, Stnford () Universit, 1999. Pgernk: A qulit () esure of Web pges () Assuption: The nuber of links to pge gives infortion bout the iportnce () of pge For web pge i, it hs 1 inlinks nd 3 outlinks. i 13-7 Google s PgeRnk Algorith (2/2) PR(X) is the PgeRnk of pge X rndo surf () (Mrkov chin, 1906) PR ( A ) PR ( A ) PR ( C ), PR ( B ), PR ( C ) 2 PR ( A ) 0 0 1 PR ( A ) PR B 1 1 PR B ( ) 0 0 ( ) Mx PR C 2 ( ) PR ( C ) 1 1 0 2 PR ( A ) PR ( B ) 2 The pgernk vector: Eigenvector of M with eigenvlue of 1. ( (Twitter) (who-to-follow) 13-8

Solving PgeRnk Algorith (1/2) Solving directl () the bove sste of equtions: 2r PR r. 2r r 1/ 5 PR( A) PR( B) PR( C) 1 Web pges A nd C re ore iportnt () thn pge B Rndoize the order of A nd C (wiki) Arithetic coplexit of Gussin eliintion: O(n 3 ) Proble: The web consists of trillion (10 12 ) of docuents (Google, 7/25/2008) 13-9 Solving PgeRnk Algorith (2/2) Insted, the pgernk vector is coputed b the following itertive () process: PR(k+1) = M PR(k) If the trix M stisfies certin conditions, the process converges () to unique distribution nd it converges ver fst! Excel: 3 * 1 =MMULT($A$2:$C$4,E2:E4) ctrl+shift+enter Stop if x 1 = 1iN x i is the L1 nor Cn use n other vector nor, e.g., Eucliden 13-10 1311 1312 PgeRnk: Probles (1) Soe pges re ded ends (hve no out-links) Rndo wlk hs nowhere to go to Such pges cuse iportnce to lek out Ded end Proble: Ded Ends ½ ½ 0 ½ 0 0 0 ½ 0 r = r /2 + r /2 r = r /2 r = r /2 (2) Spider trps () (ll out-links re within the group) Rndo wlk gets stuck in trp And eventull spider trps bsorb ll iportnce Itertion 0, 1, 2, 3, r 1/3 2/6 3/12 5/24 0 r = 1/3 1/6 2/12 3/24 0 r 1/3 1/6 1/12 2/24 0 The PgeRnk leks out since the trix is not stochstic.

1313 1314 Solution: Teleport ()! (1) Solution: Teleport ()! (2) Teleports: Follow rndo teleport links with probbilit 1.0 fro ded-ends ½ ½ r = r /2 + r /2 + r /3 ½ 0 0 ½ r = r /2 + r /3 r = r /2 + r /3 ½ ½ 0 ½ 0 0 0 ½ 0 ½ ½ ½ 0 0 ½ Itertion 0, 1, 2, r 1/3 8/18 49/108 6/13 r = 1/3 5/18 34/108 4/13 r 1/3 5/18 25/108 3/13 1315 1316 Proble: Spider Trps () Solution: Teleports! ½ ½ 0 ½ 0 0 0 ½ 1 is spider trp Itertion 0, 1, 2, r = r /2 + r /2 r = r /2 r = r /2 + r The Google solution for spider trps: At ech tie step, the rndo surfer hs two options With probbilit, follow link t rndo With prob. 1-, jup to soe rndo pge Coon vlues for re in the rnge 0.8 to 0.9 Surfer will teleport out of spider trp within steps r 1/3 2/6 3/12 5/24 0 r = 1/3 1/6 2/12 3/24 0 r 1/3 3/6 7/12 16/24 1 All the PgeRnk score gets trpped in node.

The Google Mtrix PgeRnk eqution [Brin-Pge, 98] The Google Mtrix A: We hve recursive proble: And the Power ethod still works! Wht is? In prctice =0.8 (ke 5 steps on vg., jup) 1317 [1/N] NxN N b N trix where ll entries re 1/N = Rndo Teleports ( ) 7/15 1/3 1/3 1/3 0.33 0.20 0.46 13/15 0.24 0.20 0.52 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 7/15 7/15 1/15 = 7/15 1/15 1/15 1/15 7/15 13/15 0.26 0.18 0.56 M... A 7/33 5/33 21/33 1318 [1/N] NxN 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 13.3 Efficient Coputtion of PgeRnk b MpReduce () One itertion of the PgeRnk lgorith involves tking n estited PgeRnk vector v nd coputing the next estite v b = 0.8, e is vector of ll 1 s, n is the nuber of nodes in the grph Proble: The web consists of trillion (10 12 ) of docuents (Google, 7/25/2008). v is uch too lrge to fit in in eor 1319 PgeRnk b MpReduce () Prtition trix into squre blocks Size of the trix One pproch Mpper: Copute Reducer: on one processor, totll 9 1320

Cluster Architecture 1321 Soe Infortion 1322 1 Gbps between n pir of nodes in rck Me Switch Me Ech rck contins 16-64 nodes 2-10 Gbps bckbone between rcks Switch Me Switch Me Jeffre Den nd Snj Ghewt, MpReduce: Siplified Dt Processing on Lrge Clusters, Counictions of the ACM, 2008, pp. 107-113. The pper on Google s p-reduce Chpter 2 Mp-Reduce nd the New Softwre Stck in the 2nd edition, http://www.ds.org Cheng T. Chu, et l., Mp-Reduce for Mchine Lerning on Multicore, NIPS, 2006, pge 281-288. Ipleenttions of 10 lgoriths In 2011 it ws estited tht Google hd 1M chines, http://bit.l/shh0ro