How Does Google?! A journey into the wondrous mathematics behind your favorite websites. David F. Gleich! Computer Science! Purdue University!

Similar documents
Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Analysis. Leonid E. Zhukov

Data Mining Recitation Notes Week 3

Lecture 7 Mathematics behind Internet Search

Data Mining and Matrices

1 Searching the World Wide Web

A Note on Google s PageRank

Announcements: Warm-up Exercise:

Three results on the PageRank vector: eigenstructure, sensitivity, and the derivative

Inf 2B: Ranking Queries on the WWW

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

The Giving Game: Google Page Rank

Calculating Web Page Authority Using the PageRank Algorithm

Graph Models The PageRank Algorithm

Uncertainty and Randomization

Slides based on those in:

Link Analysis. Stony Brook University CSE545, Fall 2016

Algebraic Representation of Networks

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Node Centrality and Ranking on Networks

Justification and Application of Eigenvector Centrality

Link Analysis Ranking

Node and Link Analysis

1998: enter Link Analysis

Online Social Networks and Media. Link Analysis and Web Search

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Lecture 12: Link Analysis for Web Retrieval

Finding central nodes in large networks

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Google Page Rank Project Linear Algebra Summer 2012

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Applications of The Perron-Frobenius Theorem

Math 304 Handout: Linear algebra, graphs, and networks.

How does Google rank webpages?

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

eigenvalues, markov matrices, and the power method

Computing PageRank using Power Extrapolation

CS6220: DATA MINING TECHNIQUES

On the mathematical background of Google PageRank algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The Push Algorithm for Spectral Ranking

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

The Second Eigenvalue of the Google Matrix

Link Mining PageRank. From Stanford C246

ECEN 689 Special Topics in Data Science for Communications Networks

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Application. Stochastic Matrices and PageRank

Online Social Networks and Media. Link Analysis and Web Search

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Analysis of Google s PageRank

Eigenvalue Problems Computation and Applications

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Evaluation of multi armed bandit algorithms and empirical algorithm

Applications to network analysis: Eigenvector centrality indices Lecture notes

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

0.1 Naive formulation of PageRank

Complex Social System, Elections. Introduction to Network Analysis 1

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

CS6220: DATA MINING TECHNIQUES

Applications. Nonnegative Matrices: Ranking

Three right directions and three wrong directions for tensor research

arxiv:cond-mat/ v1 3 Sep 2004

Google and Biosequence searches with Markov Chains

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mathematical Properties & Analysis of Google s PageRank

IR: Information Retrieval

Position and Displacement

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS249: ADVANCED DATA MINING

Introduction to Data Mining

Topic Models and Applications to Short Documents

Analysis and Computation of Google s PageRank

Computational Economics and Finance

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Where Is Newton Taking Us? And How Fast?

CPSC 540: Machine Learning

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Randomization and Gossiping in Techno-Social Networks

Random Surfing on Multipartite Graphs

Krylov Subspace Methods to Calculate PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Quick Introduction to Nonnegative Matrix Factorization

Approximate Inference

Facebook Friends! and Matrix Functions

Ten good reasons to use the Eigenfactor TM metrics

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

STA141C: Big Data & High Performance Statistical Computing

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Exploration. 2015/10/12 John Schulman

COMPSCI 514: Algorithms for Data Science

Using Linear Equations to Solve Problems

Transcription:

! How Does Google?! A journey into the wondrous mathematics behind your favorite websites David F. Gleich! Computer Science! Purdue University! 1

Mathematics underlies an enormous number of the websites we use everyday! 2

1. s PageRank 2. Multi-armed bandits and internet experiments 3

4

Larry Page! Sergey Brin! Created a web-search algorithm called backrub Spun-off a company Googol based on the paper Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd The PageRank Citation Ranking: Bringing Order to the Web TR, Stanford InfoLab, 1999 The importance of a page is determined by the importance of pages that link to it. 5

A websearch primer 1. Crawl webpages 2. Analyze webpage text (information retrieval) 3. Analyze webpage links 4. Fit over 200 measures to human evaluations 5. Produce rankings 6. Continuously update 6

Pages, nodes, incoming links, outgoing links, and importance c b Important pages that link to me! 7 a Important pages that link to Purdue!

8

Tim Davis and Yifan Hu Sparse Matrix Gallery

The web 1000 vertices on 8.5-by-11 paper 1,000,000,000,000 vertices (one trillion) Paper the size of Manhattan island! (23 sq miles)? 10 http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

We need something better! 11

A wee web-graph: link counting is too easy to game! 1/2 3 2 1/2 1/3 1 1/3 1/3 4 6 5 12

A wee web-graph: link counting is too easy to game! The importance of a page is determined by the importance of pages that link to it. x 1 =0 2 1/2 1/2 1/3 1 3 1/3 1/3 4 x 2 = 1 3 x 1 x 3 = 1 3 x 1 + 1 2 x 2 6 5 x 4 = 1 3 x 1 + x 3 + x 5 x 5 = x 4 x 6 = 1 2 x 2 13

The importance of a page is determined by the importance of pages that link to it 2 1/2 3 1/3 x 3 = 1 3 x 1 + 1 2 x 2 Importance of page i 1 X 1 x = i j2b i d j x j Importance of page j Back-links from page i Why it was called Backrub! Number of links page j uses! out-degree in graph theory 14

We can rewrite this equation in a more mathematically convenient way x = 0x + 0x + 0x + 0x + 0x + 0x 1 1 2 3 4 5 6 1 x = x + 0x + 0x + 0x + 0x + 0x 3 1 1 x = x + x + 0x + 0x + 0x + 0x 3 2 1 x = x + 0x + 1x + 0x + 1x + 0x 3 2 1 2 3 4 5 6 3 1 2 3 4 5 6 4 1 2 3 4 5 6 x = 0x + 0x + 0x + 1x + 0x + 0x 5 1 2 3 4 5 6 1 x = 0x + x + 0x + 0x + 0x + 0x 2 6 1 2 3 4 5 6 15

And even more conveniently! x1 0 0 0 0 0 0 x1 x 2 1/ 3 0 0 0 0 0 x 2 x 3 1/ 3 1/ 2 0 0 0 0 x 3 = x4 1/ 3 0 1 0 1 0 x4 x 5 0 0 0 1 0 0 x 5 x 0 1/2 0 0 0 0 x 6 6 Element k in column m = "probability" of going from node m to node k or x = Px 16

The matrix P for websites shows a lot of structure Every dot is a non-zero element indicating a link Matrices are sparse, and generally with block structure block structure can be explored to speed up ranking algorithm 17

But this idea doesn t work for the wee web-graph Nodes 1, 4 and 5 determine everything! 1/2 3 x 1 =0 x 2 = 1 3 x 1 =0 2 1/2 1/3 1 1/3 1/3 4 x 3 = 1 3 x 1 + 1 2 x 2 =0 x 4 = 1 3 x 1 + x 3 + x 5 = x 5 x 5 = x 4 x 6 = 1 2 x 2 =0 6 5 18

But this idea doesn t work for the wee web-graph Node 1! lonely Nodes 4 and 5! mutual admiration societies Node 6 anti-social 2 1/2 6 1/2 1/3 1 3 1/3 5 1/3 4 These nodes need to be fixed to get a reliable and useful ranking! 19

The gang of four to the rescue Andrei Markov Oscar Perron Georg Frogenius Richard! von Mises 20

Let s fix it up and force node 6 to choose, or link to everyone 2 3 0 0 0 0 0 0 1/3 0 0 0 0 0 P = 1/3 1/2 0 0 0 0 61/3 0 1 0 1 0 7 4 0 0 0 1 0 05 0 1/2 0 0 0 0 2 3 0 0 0 0 0 1/6 1/3 0 0 0 0 1/6 P = 1/3 1/2 0 0 0 1/6 61/3 0 1 0 1 1/6 7 4 0 0 0 1 0 1/65 0 1/2 0 0 0 1/6 2 6 1 3 5 4 21

Taxation is the way to representation! b a c If is a good page, then it ll still be a good page if we tax the importance from a, b, and c We can redistribute the taxed amounts to all including lonely nodes! 22

The importance of a page is determined by the importance of pages that link to it * The taxation rate of all x i = X j2b i x j d j + (1 )b i Benefits to page i The total importance that page j! contributes to page i * After tax and any benefits 23

Perron and Frobenius showed the new equation always has a unique solution! # # # # # # # # # " x 1 x 2 x 3 x 4 x 5 x 6 $ &! & # & # & # & = α# & # & # & & " # % 0 0 0 0 0 1/ 6 1/ 3 0 0 0 0 1/ 6 1/ 3 1/ 2 0 0 0 1/ 6 1/ 3 0 1 0 1 1/ 6 0 0 0 1 0 1/ 6 0 1/ 2 0 0 0 1/ 6! $ # &# &# &# &# &# &# % &# # " x 1 x 2 x 3 x 4 x 5 x 6 $ & & & & & & & & & %! # # # # + (1 α) # # # # # " b 1 b 2 b 3 b 4 b 5 b 6 $ & & & & & & & & & % x = Px + (1 )b 24

What von Mises and Richardson showed is that guess, check, and correct works! x (new) = Px (old) + (1 x (start) = 2 3 2 3 0.17 0.05 0.17 0.10 0.17 60.17 x (1) = 0.17 7 60.38 x (2) = 40.175 2 3 7 0.03 40.195 0.17 0.12 0.04 1/2 3 x (1) = 0.06 60.43 7 40.395 0.05 )b 2 3 0.04 0.06 0.10 60.36 7 40.365 0.08 2 1/3 1/3 1/3 4 1/2 1 6 5 25

26

There s still a lot of work left to do to make a search engine Make it fast! Watch out for spam Watch out for manipulation Personalize Experiment! 27

1. s PageRank 2. Multi-armed bandits and internet experiments 28

Not this! http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/ 29

This! Pays out! $0.99/ dollar Pays out! $0.95/ dollar Pays out! $0.92/ dollar Pays out! $0.98/ dollar http://upload.wikimedia.org/wikipedia/en/8/82/las_vegas_slot_machines.jpg 30

What in the heck does a multi-armed bandit have to do with Google? 31

What in the heck does a multi-armed bandit have to do with Google? Pays out! $0.91/ view to show ads Pays out! -$0.02/view hide ads Pays out! $0.92/ view Pays out! $0.66/ view 32

How to optimize your website without exploiting the bandits Try condition A 100 times, find 45 wins Try condition B 100 times, find 85 wins Try condition C 100 times, find 10 wins Choose the best! 33

This field has some of the best terminology Explore! Exploit! Regret 34

This field has some of the best terminology Explore Visiting Las Vegas! Exploit Your new winning strategy! Regret That you didn t quit after winning the first round 35

This field has some of the best terminology Explore Testing slot machines/ experiments for their reward Exploit Playing the best reward you ve found so far Regret How much you lost due! to exploration 36

How to optimize your website without exploiting the bandits Try condition A 100 times, find 45 wins Try condition B 100 times, find 85 wins Try condition C 100 times, find 10 wins Choose the best! We only exploit our findings at the end! Pure exploration! 37

How to optimize your website exploiting the bandits Try condition A 5 times, find 4 wins! Try condition B 5 times, find 4 wins! Try condition C 5 times, find 2 wins Try condition A 7 times, find 3 wins! Try condition B 7 times, find 5 wins! Try condition C 1 time, find 0 wins Condition A B C Est. Return 0.58 0.75 0.33 Pure exploration! Exploit our knowledge 38

The goal of these problems is to construct optimal strategies to minimize regret Regret how much you left on the table by exploring E[play best always plays made based on data] regret 100-each 255/300 140/300 = 0.38 regret 30-mixed 25.5/30 0.45 12 + 0.85 12 + 0.1 6 = 0.31 zero-regret strategy is one where regret(t trials) is sublinear in T! as the number of plays T 39

[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage. Peter Whittle (Whittle, 1979) Discussion of Bandit processes and dynamical allocation indices Their importance to website optimization, advertising, and recommendation has rejuvenated research on these problems with fascinating new questions. 40

Math is everywhere and especially your favorite websites! Matrices and probability are key ingredients. 41

= 0.50 United States C:Living people France Germany England United Kingdom Canada Japan Poland Australia = 0.85 United States C:Main topic classif. C:Contents C:Living people C:Ctgs. by country United Kingdom C:Fundamental C:Ctgs. by topic C:Wikipedia admin. France = 0.99 C:Contents C:Main topic classif. C:Fundamental United States C:Wikipedia admin. P:List of portals P:Contents/Portals C:Portals C:Society C:Ctgs. by topic Note Top 10 articles on Wikipedia with highest PageRank 42