Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Similar documents
Slides based on those in:

Link Mining PageRank. From Stanford C246

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Data and Algorithms of the Web

Introduction to Data Mining

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Online Social Networks and Media. Link Analysis and Web Search

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Link Analysis. Stony Brook University CSE545, Fall 2016

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS249: ADVANCED DATA MINING

0.1 Naive formulation of PageRank

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Online Social Networks and Media. Link Analysis and Web Search

Jeffrey D. Ullman Stanford University

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Jeffrey D. Ullman Stanford University

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

A Note on Google s PageRank

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

1998: enter Link Analysis

Link Analysis Ranking

Data Mining Techniques

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Link Analysis & Ranking CS 224W

Link Analysis. Leonid E. Zhukov

Data Mining Recitation Notes Week 3

IR: Information Retrieval

Uncertainty and Randomization

Lecture 12: Link Analysis for Web Retrieval

ECEN 689 Special Topics in Data Science for Communications Networks

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

eigenvalues, markov matrices, and the power method

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Data Mining and Matrices

Computing PageRank using Power Extrapolation

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Web Structure Mining Nodes, Links and Influence

STA141C: Big Data & High Performance Statistical Computing

Matching. Lecture 13 Link Analysis ( ) 13.1 Link Analysis ( ) 13.2 Google s PageRank Algorithm The Top Ten Algorithms in Data Mining

Graph Models The PageRank Algorithm

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

How does Google rank webpages?

Lecture 7 Mathematics behind Internet Search

Page rank computation HPC course project a.y

Google Page Rank Project Linear Algebra Summer 2012

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Data Mining Techniques

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Faloutsos, Tong ICDE, 2009

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 Searching the World Wide Web

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Node Centrality and Ranking on Networks

Math 304 Handout: Linear algebra, graphs, and networks.

Degree Distribution: The case of Citation Networks

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Applications. Nonnegative Matrices: Ranking

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Inf 2B: Ranking Queries on the WWW

Mathematical Properties & Analysis of Google s PageRank

Analysis and Computation of Google s PageRank

Final Exam, Machine Learning, Spring 2009

Lecture: Local Spectral Methods (1 of 4)

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Class President: A Network Approach to Popularity. Due July 18, 2014

CS60021: Scalable Data Mining. Large Scale Machine Learning

Sampling. Everything Data CompSci Spring 2014

Node and Link Analysis

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Analysis of Google s PageRank

Complex Social System, Elections. Introduction to Network Analysis 1

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CS47300: Web Information Search and Management

To Randomize or Not To

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Approximate Inference

Transcription:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

#1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM Classification (58 votes) #4: Apriori - Frequent Itemsets (52 votes) #5: EM Clustering (48 votes) #6: PageRank Link mining (46 votes) #7: AdaBoost Boosting (45 votes) #7: knn Classification (45 votes) #7: Naive Bayes Classification (45 votes) #10: CART Classification (34 votes) Data Mining: Concepts and Techniques 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Content based: Find relevant docs Works well in a small and trusted set, e.g. Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc.

Data Mining: Concepts and Techniques 6 Link based ranking algorithms PageRank HITS

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 All web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu There is large diversity in the web-graph node connectivity. Let s rank the pages by the link structure!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9 Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 in-links www.joe-schmoe.com has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 Each link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links r j = r i /3+r k /4 i k r i /3 rk /4 j r j /3 r j /3 r j /3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for page j r j i j r i d i d i out-degree of node i a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m

3 equations, 3 unknowns, no constants r a No unique solution All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: r y + r a + r m = 1 Flow equations: r y = r y /2 + r a /2 = r y /2 + r m r m = r a /2 Solution: r y = 2 5, r a = 2 5, r m = 1 5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 Stochastic adjacency matrix M Let page i has d i out-links If i j, then M ji = 1 d i else M ji = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r: vector with an entry per page r i is the importance score of page i i r i = 1 The flow equations can be written r r = M r j i j r i d i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 Remember the flow equation: rj Flow equation in the matrix form M r = r Suppose page i links to 3 pages, including j i r i i j d i 1/3 j. r i = r j M. r = r

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16 a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r = M r r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 The flow equations can be written r = M r So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Largest eigenvalue of M is 1 since M is column stochastic (with non-negative entries) We can now efficiently solve for r! The method is called Power iteration NOTE: x is an eigenvector with the corresponding eigenvalue λ if: Ax = λx

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r (0) = [1/N,.,1/N] T Iterate: r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < x 1 = 1 i N x i is the L1 norm Can use any other vector norm, e.g., Euclidean r ( t 1) j i j ( t) i r d d i. out-degree of node i i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Imagine a random web surfer: At any time t, surfer is on some page i At time t + 1, the surfer follows an out-link from i uniformly at random Ends up on some page j linked from i Process repeats indefinitely Let: p(t) vector whose i th coordinate is the prob. that the surfer is at page i at time t So, p(t) is a probability distribution over pages r j i 1 i 2 i 3 j i j d out ri (i)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 Where is the surfer at time t+1? Follows a link uniformly at random p t + 1 = M p(t) Suppose the random walk reaches a state p t + 1 = M p(t) = p(t) then p(t) is stationary distribution of a random walk Our original rank vector r satisfies r = M r So, r is a stationary distribution for the random walk i 1 i 2 i 3 j p( t 1) M p( t)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 r ( t 1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 a b Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, r ( t 1) j i j r ( t) i d i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24 a b Example: r a 1 0 0 0 = r b 0 1 0 0 r ( t 1) j i j r ( t) i d i Iteration 0, 1, 2,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected, no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 2 problems: (1) Some pages are dead ends (have no out-links) Random walk has nowhere to go to Such pages cause importance to leak out Dead end (2) Spider traps: (all out-links are within the group) Random walked gets stuck in a trap And eventually spider traps absorb all importance

a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 m is a spider trap Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, All the PageRank score gets trapped in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 The Google solution for spider traps: At each time step, the random surfer has two options With prob., follow a link at random With prob. 1-, jump to some random page Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m

a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, Here the PageRank leaks out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps PageRank scores are not what we want Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps Dead-ends are a problem The matrix is not column stochastic so our initial assumptions are not met Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33 Google s solution that does it all: At each step, random surfer has two options: With probability, follow a link at random With probability 1-, jump to some random page PageRank equation [Brin-Page, 98] r j = i j β r i d i + (1 β) 1 N d i out-degree of node i This formulation assumes that M has no dead ends. We can either preprocess matrix M to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 PageRank equation [Brin-Page, 98] r j = i j β r i d i + (1 β) 1 N The Google Matrix A: A = β M + 1 β 1 N N N We have a recursive problem: And the Power method still works! What is? [1/N] NxN N by N matrix where all entries are 1/N In practice =0.8,0.9 (make 5 steps on avg., jump)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 y 7/15 M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N] NxN 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 a m 13/15 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56... 7/33 5/33 21/33

Input: Graph G and parameter β Directed graph G (can have spider traps and dead ends) Parameter β Output: PageRank vector r new Set: r old j = 1 N repeat until convergence: j r new j r old j > ε j: r new j = i j β r i old d i r new j = 0 if in-degree of j is 0 Now re-insert the leaked PageRank: j: r new j = r j new + 1 S new where: S = N j r j r old = r new If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39 Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank Uses a single measure of importance Other models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank