! How Does Google?! A journey into the wondrous mathematics behind your favorite websites David F. Gleich! Computer Science! Purdue University! 1
Mathematics underlies an enormous number of the websites we use everyday! 2
1. s PageRank 2. Multi-armed bandits and internet experiments 3
4
Larry Page! Sergey Brin! Created a web-search algorithm called backrub Spun-off a company Googol based on the paper Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd The PageRank Citation Ranking: Bringing Order to the Web TR, Stanford InfoLab, 1999 The importance of a page is determined by the importance of pages that link to it. 5
A websearch primer 1. Crawl webpages 2. Analyze webpage text (information retrieval) 3. Analyze webpage links 4. Fit over 200 measures to human evaluations 5. Produce rankings 6. Continuously update 6
Pages, nodes, incoming links, outgoing links, and importance c b Important pages that link to me! 7 a Important pages that link to Purdue!
8
Tim Davis and Yifan Hu Sparse Matrix Gallery
The web 1000 vertices on 8.5-by-11 paper 1,000,000,000,000 vertices (one trillion) Paper the size of Manhattan island! (23 sq miles)? 10 http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
We need something better! 11
A wee web-graph: link counting is too easy to game! 1/2 3 2 1/2 1/3 1 1/3 1/3 4 6 5 12
A wee web-graph: link counting is too easy to game! The importance of a page is determined by the importance of pages that link to it. x 1 =0 2 1/2 1/2 1/3 1 3 1/3 1/3 4 x 2 = 1 3 x 1 x 3 = 1 3 x 1 + 1 2 x 2 6 5 x 4 = 1 3 x 1 + x 3 + x 5 x 5 = x 4 x 6 = 1 2 x 2 13
The importance of a page is determined by the importance of pages that link to it 2 1/2 3 1/3 x 3 = 1 3 x 1 + 1 2 x 2 Importance of page i 1 X 1 x = i j2b i d j x j Importance of page j Back-links from page i Why it was called Backrub! Number of links page j uses! out-degree in graph theory 14
We can rewrite this equation in a more mathematically convenient way x = 0x + 0x + 0x + 0x + 0x + 0x 1 1 2 3 4 5 6 1 x = x + 0x + 0x + 0x + 0x + 0x 3 1 1 x = x + x + 0x + 0x + 0x + 0x 3 2 1 x = x + 0x + 1x + 0x + 1x + 0x 3 2 1 2 3 4 5 6 3 1 2 3 4 5 6 4 1 2 3 4 5 6 x = 0x + 0x + 0x + 1x + 0x + 0x 5 1 2 3 4 5 6 1 x = 0x + x + 0x + 0x + 0x + 0x 2 6 1 2 3 4 5 6 15
And even more conveniently! x1 0 0 0 0 0 0 x1 x 2 1/ 3 0 0 0 0 0 x 2 x 3 1/ 3 1/ 2 0 0 0 0 x 3 = x4 1/ 3 0 1 0 1 0 x4 x 5 0 0 0 1 0 0 x 5 x 0 1/2 0 0 0 0 x 6 6 Element k in column m = "probability" of going from node m to node k or x = Px 16
The matrix P for websites shows a lot of structure Every dot is a non-zero element indicating a link Matrices are sparse, and generally with block structure block structure can be explored to speed up ranking algorithm 17
But this idea doesn t work for the wee web-graph Nodes 1, 4 and 5 determine everything! 1/2 3 x 1 =0 x 2 = 1 3 x 1 =0 2 1/2 1/3 1 1/3 1/3 4 x 3 = 1 3 x 1 + 1 2 x 2 =0 x 4 = 1 3 x 1 + x 3 + x 5 = x 5 x 5 = x 4 x 6 = 1 2 x 2 =0 6 5 18
But this idea doesn t work for the wee web-graph Node 1! lonely Nodes 4 and 5! mutual admiration societies Node 6 anti-social 2 1/2 6 1/2 1/3 1 3 1/3 5 1/3 4 These nodes need to be fixed to get a reliable and useful ranking! 19
The gang of four to the rescue Andrei Markov Oscar Perron Georg Frogenius Richard! von Mises 20
Let s fix it up and force node 6 to choose, or link to everyone 2 3 0 0 0 0 0 0 1/3 0 0 0 0 0 P = 1/3 1/2 0 0 0 0 61/3 0 1 0 1 0 7 4 0 0 0 1 0 05 0 1/2 0 0 0 0 2 3 0 0 0 0 0 1/6 1/3 0 0 0 0 1/6 P = 1/3 1/2 0 0 0 1/6 61/3 0 1 0 1 1/6 7 4 0 0 0 1 0 1/65 0 1/2 0 0 0 1/6 2 6 1 3 5 4 21
Taxation is the way to representation! b a c If is a good page, then it ll still be a good page if we tax the importance from a, b, and c We can redistribute the taxed amounts to all including lonely nodes! 22
The importance of a page is determined by the importance of pages that link to it * The taxation rate of all x i = X j2b i x j d j + (1 )b i Benefits to page i The total importance that page j! contributes to page i * After tax and any benefits 23
Perron and Frobenius showed the new equation always has a unique solution! # # # # # # # # # " x 1 x 2 x 3 x 4 x 5 x 6 $ &! & # & # & # & = α# & # & # & & " # % 0 0 0 0 0 1/ 6 1/ 3 0 0 0 0 1/ 6 1/ 3 1/ 2 0 0 0 1/ 6 1/ 3 0 1 0 1 1/ 6 0 0 0 1 0 1/ 6 0 1/ 2 0 0 0 1/ 6! $ # &# &# &# &# &# &# % &# # " x 1 x 2 x 3 x 4 x 5 x 6 $ & & & & & & & & & %! # # # # + (1 α) # # # # # " b 1 b 2 b 3 b 4 b 5 b 6 $ & & & & & & & & & % x = Px + (1 )b 24
What von Mises and Richardson showed is that guess, check, and correct works! x (new) = Px (old) + (1 x (start) = 2 3 2 3 0.17 0.05 0.17 0.10 0.17 60.17 x (1) = 0.17 7 60.38 x (2) = 40.175 2 3 7 0.03 40.195 0.17 0.12 0.04 1/2 3 x (1) = 0.06 60.43 7 40.395 0.05 )b 2 3 0.04 0.06 0.10 60.36 7 40.365 0.08 2 1/3 1/3 1/3 4 1/2 1 6 5 25
26
There s still a lot of work left to do to make a search engine Make it fast! Watch out for spam Watch out for manipulation Personalize Experiment! 27
1. s PageRank 2. Multi-armed bandits and internet experiments 28
Not this! http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/ 29
This! Pays out! $0.99/ dollar Pays out! $0.95/ dollar Pays out! $0.92/ dollar Pays out! $0.98/ dollar http://upload.wikimedia.org/wikipedia/en/8/82/las_vegas_slot_machines.jpg 30
What in the heck does a multi-armed bandit have to do with Google? 31
What in the heck does a multi-armed bandit have to do with Google? Pays out! $0.91/ view to show ads Pays out! -$0.02/view hide ads Pays out! $0.92/ view Pays out! $0.66/ view 32
How to optimize your website without exploiting the bandits Try condition A 100 times, find 45 wins Try condition B 100 times, find 85 wins Try condition C 100 times, find 10 wins Choose the best! 33
This field has some of the best terminology Explore! Exploit! Regret 34
This field has some of the best terminology Explore Visiting Las Vegas! Exploit Your new winning strategy! Regret That you didn t quit after winning the first round 35
This field has some of the best terminology Explore Testing slot machines/ experiments for their reward Exploit Playing the best reward you ve found so far Regret How much you lost due! to exploration 36
How to optimize your website without exploiting the bandits Try condition A 100 times, find 45 wins Try condition B 100 times, find 85 wins Try condition C 100 times, find 10 wins Choose the best! We only exploit our findings at the end! Pure exploration! 37
How to optimize your website exploiting the bandits Try condition A 5 times, find 4 wins! Try condition B 5 times, find 4 wins! Try condition C 5 times, find 2 wins Try condition A 7 times, find 3 wins! Try condition B 7 times, find 5 wins! Try condition C 1 time, find 0 wins Condition A B C Est. Return 0.58 0.75 0.33 Pure exploration! Exploit our knowledge 38
The goal of these problems is to construct optimal strategies to minimize regret Regret how much you left on the table by exploring E[play best always plays made based on data] regret 100-each 255/300 140/300 = 0.38 regret 30-mixed 25.5/30 0.45 12 + 0.85 12 + 0.1 6 = 0.31 zero-regret strategy is one where regret(t trials) is sublinear in T! as the number of plays T 39
[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage. Peter Whittle (Whittle, 1979) Discussion of Bandit processes and dynamical allocation indices Their importance to website optimization, advertising, and recommendation has rejuvenated research on these problems with fascinating new questions. 40
Math is everywhere and especially your favorite websites! Matrices and probability are key ingredients. 41
= 0.50 United States C:Living people France Germany England United Kingdom Canada Japan Poland Australia = 0.85 United States C:Main topic classif. C:Contents C:Living people C:Ctgs. by country United Kingdom C:Fundamental C:Ctgs. by topic C:Wikipedia admin. France = 0.99 C:Contents C:Main topic classif. C:Fundamental United States C:Wikipedia admin. P:List of portals P:Contents/Portals C:Portals C:Society C:Ctgs. by topic Note Top 10 articles on Wikipedia with highest PageRank 42