CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Slides based on those in:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Mining PageRank. From Stanford C246

Data and Algorithms of the Web

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS6220: DATA MINING TECHNIQUES

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Introduction to Data Mining

Link Analysis. Stony Brook University CSE545, Fall 2016

0.1 Naive formulation of PageRank

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

1998: enter Link Analysis

A Note on Google s PageRank

Computing PageRank using Power Extrapolation

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Jeffrey D. Ullman Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

IR: Information Retrieval

Link Analysis. Leonid E. Zhukov

Jeffrey D. Ullman Stanford University

Lecture 12: Link Analysis for Web Retrieval

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Uncertainty and Randomization

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Google Page Rank Project Linear Algebra Summer 2012

Data Mining and Matrices

Page rank computation HPC course project a.y

How does Google rank webpages?

STA141C: Big Data & High Performance Statistical Computing

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Complex Social System, Elections. Introduction to Network Analysis 1

Inf 2B: Ranking Queries on the WWW

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Mathematical Properties & Analysis of Google s PageRank

Data Mining Recitation Notes Week 3

Math 304 Handout: Linear algebra, graphs, and networks.

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

The Google Markov Chain: convergence speed and eigenvalues

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis & Ranking CS 224W

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Web Structure Mining Nodes, Links and Influence

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Graph Models The PageRank Algorithm

Node and Link Analysis

CPSC 540: Machine Learning

Data Mining Techniques

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Analysis of Google s PageRank

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Extrapolation Methods for Accelerating PageRank Computations

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

eigenvalues, markov matrices, and the power method

The Second Eigenvalue of the Google Matrix

Computational Economics and Finance

Applications. Nonnegative Matrices: Ranking

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

CS47300: Web Information Search and Management

The Theory behind PageRank

On the mathematical background of Google PageRank algorithm

Machine Learning CPSC 340. Tutorial 12

Degree Distribution: The case of Citation Networks

Updating PageRank. Amy Langville Carl Meyer

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Faloutsos, Tong ICDE, 2009

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

CPSC 540: Machine Learning

Data Mining Techniques

1 Searching the World Wide Web

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 188: Artificial Intelligence Spring Announcements

Node Centrality and Ranking on Networks

Lecture 7 Mathematics behind Internet Search

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

To Randomize or Not To

Transcription:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure vs.

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4 Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Are all in-links are equal? Links from important pages count more Recursive question!

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 Each link s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p s own importance is the sum of the votes on its in-links p

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for node j r j = i j r i d out (i) a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y + a + m = 1 y = 2/5, a = 2/5, m = 1/5 Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Stochastic adjacency matrix M Let page j has d j out-links If j i, then M ij = 1/d j else M ij = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r: vector with an entry per page r i is the importance score of page i i r i = 1 The flow equations can be written r = M r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 Suppose page j links to 3 pages, including i j i 1/3 = i M r r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 The flow equations can be written r = M r So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r = Mr r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r (0) = [1/N,.,1/N] T Iterate: r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < ε x 1 = 1 i N x i is the L1 norm Can use any other vector norm e.g., Euclidean r ( t+ 1) j = i j ( t) i r d d i. out-degree of node i i

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Power Iteration: Set r j = 1/N r j = i j r i d i And iterate r i = j M ij r j Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Imagine a random web surfer: At any time t, surfer is on some page u At time t+1, the surfer follows an out-link from u uniformly at random Ends up on some page v linked from u Process repeats indefinitely Let: p(t) vector whose i th coordinate is the prob. that the surfer is at page i at time t p(t) is a probability distribution over pages r j i 1 i 2 i 3 = j i j d out ri (i)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = M p(t) Suppose the random walk reaches a state p(t+1) = M p(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = M r So, it is a stationary distribution for the random walk i 1 i 2 i 3 j p( t + 1) = M p( t)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 r ( t+ 1) j = i j r ( t) i d i or equivalently r = Mr Does this converge? Does it converge to what we want? Are results reasonable?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 a b Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, r ( t+ 1) j = i j r ( t) i d i

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 a b Example: r a 1 0 0 0 = r b 0 1 0 0 r ( t+ 1) j = i j r ( t) i d i Iteration 0, 1, 2,

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 2 problems: Some pages are dead ends (have no out-links) Such pages cause importance to leak out Spider traps (all out links are within the group) Eventually spider traps absorb all importance

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 Power Iteration: Set r j = 1 r j = i j r i d i And iterate a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, r m = r a /2 + r m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 The Google solution for spider traps: At each time step, the random surfer has two options: With probability β, follow a link at random With probability 1-β, jump to some page uniformly at random Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Power Iteration: Set r j = 1 r j = i j r i d i And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 r + = Mr ( t 1) ( t ) Markov Chains Set of states X Transition matrix P where P ij = P(X t =i X t-1 =j) π specifying the probability of being at each state x X Goal is to find π such that π = P π

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 Stochastic: Every column sums to 1 A possible solution: Add green links S = M + a T ) a y m y a m y ½ ½ 1/3 a ½ 0 1/3 m 0 ½ 1/3 1 ( 1 n a i =1 if node i has out deg 0, =0 else 1 vector of all 1s r y = r y /2 + r a /2 + r m /3 r a = r y /2+ r m /3 r m = r a /2 + r m /3

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links y a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links y a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 Google s solution that does it all: Makes M stochastic, aperiodic, irreducible At each step, random surfer has two options: With probability 1-β, follow a link at random With probability β, jump to some random page PageRank equation [Brin-Page, 98] r j = (1 β) r i i j d i Assuming we follow random teleport links with probability 1.0 from dead-ends + β 1 n d i out-degree of node i

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 PageRank equation [Brin-Page, 98] r j = (1 β) r i d i i j The Google Matrix A: + β 1 n A = 1 β S + β 1 1 1T n G is stochastic, aperiodic and irreducible, so r (t+1) = A r (t) What is β? In practice β =0.15 (make 5 steps and jump)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 y 0.8 ½+0.2 ⅓ S 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 1/n 1 1 T 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 a m 0.8+0.2 ⅓ y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56... 7/33 5/33 21/33

Suppose there are N pages Consider a page j, with set of out-links O(j) We have M ij = 1/ O(j) when j i and M ij = 0 otherwise The random teleport is equivalent to Adding a teleport link from j to every other page with probability (1-β)/N Reducing the probability of following each out-link from 1/ O(j) to β/ O(j) Equivalent: Tax each page a fraction (1-β) of its score and redistribute evenly 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 33 Construct the N x N matrix A as follows A ij = β M ij + (1-β)/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix A satisfying r = A r Equivalently, r is the stationary distribution of the random walk with teleports

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 Key step is matrix-vector multiplication r new = A r old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries 10 18 is a large number! A = A = β M + (1-β) [1/N] NxN ½ ½ 0 ½ 0 0 0 ½ 1 0.8 +0.2 = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 r = A r, where A ii = β M ii + 1 β N r i = j=1 A ii r j r i = N β M ii + 1 β j=1 r N j = N β M ii r j + 1 β j=1 N r N j=1 j = N β M ii r j + 1 β j=1, since r N j = 1 So, r = β M r + 1 β N N N [x] N a vector of length N with all entries x

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 36 We can rearrange the PageRank equation r = ββ r + 1 β N N [(1-β)/N] N is an N-vector with all entries (1-β)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = β M r old Add a constant value (1-β)/N to each entry in r new

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Encode sparse matrix using only nonzero entries Space proportional roughly to number of links Say 10N, or 4*10*1 billion = 40GB Still won t fit in memory, but will fit on disk source node degree destination nodes 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

Assume enough RAM to fit r new into memory Store r old and matrix M on disk Then 1 step of power-iteration is: Initialize all entries of r new to (1-β)/N For each page p (of out-degree n): Read into memory: p, n, dest 1,,dest n, r old (p) for j = 1 n: r new (dest j ) += β r old (p) / n 0 1 2 3 4 5 6 r new src degree destination 0 3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38 r old 0 1 2 3 4 5 6

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 Assume enough RAM to fit r new into memory Store r old and matrix M on disk In each iteration, we have to: Read r old and M Write r new back to disk IO cost = 2 r + M Question: What if we could not even fit r new in memory?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40 0 1 2 3 4 5 r new src degree destination 0 4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4 r old 0 1 2 3 4 5

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41 Similar to nested-loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block k scans of M and r old k( M + r ) + r = k M + (k+1) r Can we do better? Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42 0 1 2 3 r new src degree destination 0 4 0, 1 1 3 0 2 2 1 0 4 3 2 2 3 r old 0 1 2 3 4 5 4 5 0 4 5 1 3 5 2 2 4

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43 Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new Some additional overhead per stripe But it is usually worth it Cost per iteration M (1+ε) + (k+1) r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44 Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank (next) Uses a single measure of importance Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next) Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank (next)