CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Data and Algorithms of the Web

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Mining PageRank. From Stanford C246

Slides based on those in:

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS6220: DATA MINING TECHNIQUES

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Introduction to Data Mining

Jeffrey D. Ullman Stanford University

0.1 Naive formulation of PageRank

Jeffrey D. Ullman Stanford University

Link Analysis. Stony Brook University CSE545, Fall 2016

Online Social Networks and Media. Link Analysis and Web Search

A Note on Google s PageRank

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Online Social Networks and Media. Link Analysis and Web Search

Computing PageRank using Power Extrapolation

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

IR: Information Retrieval

1998: enter Link Analysis

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Link Analysis. Leonid E. Zhukov

Link Analysis Ranking

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Analysis & Ranking CS 224W

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

How does Google rank webpages?

Google Page Rank Project Linear Algebra Summer 2012

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Data Mining and Matrices

Data Mining Techniques

Lecture 12: Link Analysis for Web Retrieval

Uncertainty and Randomization

Data Mining Recitation Notes Week 3

1 Searching the World Wide Web

Page rank computation HPC course project a.y

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Web Structure Mining Nodes, Links and Influence

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

eigenvalues, markov matrices, and the power method

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Data Mining Techniques

Complex Social System, Elections. Introduction to Network Analysis 1

ECEN 689 Special Topics in Data Science for Communications Networks

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS47300: Web Information Search and Management

Lecture: Local Spectral Methods (1 of 4)

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Math 304 Handout: Linear algebra, graphs, and networks.

Mathematical Properties & Analysis of Google s PageRank

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Link Analysis and Web Search

Graph Models The PageRank Algorithm

Inf 2B: Ranking Queries on the WWW

Analysis of Google s PageRank

Lecture 7 Mathematics behind Internet Search

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

COMPSCI 514: Algorithms for Data Science

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

STA141C: Big Data & High Performance Statistical Computing

Algebraic Representation of Networks

Node and Link Analysis

CS 188: Artificial Intelligence Spring Announcements

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Applications. Nonnegative Matrices: Ranking

The Second Eigenvalue of the Google Matrix

Class President: A Network Approach to Popularity. Due July 18, 2014

Machine Learning CPSC 340. Tutorial 12

Application. Stochastic Matrices and PageRank

CS 188: Artificial Intelligence

Calculating Web Page Authority Using the PageRank Algorithm

Degree Distribution: The case of Citation Networks

CS 188: Artificial Intelligence Spring 2009

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

CPSC 540: Machine Learning

Consider the following example of a linear system:

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Transcription:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

3 What is the structure of the Web? How is it organized? Web as a directed graph

4 Two types of directed graphs: DAG Directed Acyclic Graph: Has no cycles: if u can reach v, then v can not reach u Strongly connected: Any node can reach any node via a directed path Any directed graph can be expressed in terms of these two types of graphs

Strongly connected component (SCC) is a set of nodes S: Every pair of nodes in S can reach each other There is no larger set containing S with this property Any directed graph is a DAG on its SCCs: Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

6 Take a large snapshot of the Web and try to understand how it s SCCs fit as a DAG Computational issues: Say want to find SCC containing specific node v? Observation: Out(v) nodes reachable from v (via out-edges) In(v) nodes reachable from v (via in-edges) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G) where G is G with directions of edges flipped v

[Broder et al., 00] 250 million webpages, 1.5 billion links [Altavista] 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

8 Out-/In- Degree Distribution: p k : fraction of nodes with k out-/in-links Histogram of p k vs. k Normalized count, p k

9 Plot the same data on log-log axes: Normalized count, p k log p k = βk α p k = log β α log k

[Broder et al., 00] 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

Random network Power-law network Degree distribution is Binomial, i.e., all nodes have similar degree 2/7/2011 Degrees are Power-law, i.e., heavily skewed Jure Leskovec, Stanford C246: Mining Massive Datasets 11

12 Web pages are not equally important www.joe-schmoe.com vs. www.stanford.edu Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure

13 We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Spam Detection Algorithms

14 First try: Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Are all in-links are equal? Links from important pages count more Recursive question!

15 Each link s vote is proportional to the importance of its source page If page P with importance x has n out-links, each link gets x/n votes Page P s own importance is the sum of the votes on its in-links

16 The web in 1839 y/2 y Yahoo y = y /2 + a /2 a = y /2 + m m = a /2 a/2 y/2 Amazon a m a/2 M soft m

17 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs

Matrix M has one row and one column for each web page Suppose page j has n out-links If j i, then M ij = 1/n else M ij = 0 M is a column stochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page: r i is the importance score of page i Call it the rank vector r = 1 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

19 Suppose page j links to 3 pages, including i j i 1/3 = i M r r

20 The flow equations can be written r = M r So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1

21 Yahoo Y! A MS Y! ½ ½ 0 A ½ 0 1 MS 0 ½ 0 Amazon M soft y = y /2 + a /2 a = y /2 + m m = a /2 r = Mr y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

22 Simple iterative scheme Suppose there are N web pages Initialize: r 0 = [1/N,.,1/N] T Iterate: r k+1 = M r k Stop when r k+1 - r k 1 < ε x 1 = 1 i N x i is the L1 norm Can use any other vector norm e.g., Euclidean

23 Power iteration: Set r i =1/n Y! Y! A MS Y! ½ ½ 0 r i = j M ij r j And iterate A MS A ½ 0 1 MS 0 ½ 0 Example: y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6... 2/5 2/5 1/5

24 Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose i th component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages

25 Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = M p(t) Suppose the random walk reaches a state such that p(t+1) = M p(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = M r So it is a stationary distribution for the random surfer

26 A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

27 Some pages are dead ends (have no out-links) Such pages cause importance to leak out Spider traps (all out links are within the group) A group of pages is a spider trap if there are no links from within the group to the outside of the group Random surfer gets trapped And eventually spider traps absorb all importance

28 Power iteration: Set r i =1 Y! r i = j M ij r j And iterate Example: y 1 1 ¾ 5/8 0 a = 1 ½ ½ 3/8 0 m 1 3/2 7/4 2 3 A MS Y! A MS Y! ½ ½ 0 A ½ 0 0 MS 0 ½ 1

29 The Google solution for spider traps At each time step, the random surfer has two options: With probability β, follow a link at random With probability 1-β, jump to some page uniformly at random Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps

30 0.2*1/3 Yahoo 1/2 0.8*1/2 1/2 0.8*1/2 0.2*1/3 0.2*1/3 Amazon M soft y y 1/2 a 1/2 m 0 0.8* y 1/2 1/2 0 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 + 0.2* y 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

31 Yahoo 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 Amazon M soft y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688... 7/11 5/11 21/11

32 Some pages are dead ends (have no out-links) Such pages cause importance to leak out Power iteration: Set r i =1 r i = j M ij r j And iterate Example: y 1 1 ¾ 5/8 0 a = 1 ½ ½ 3/8 0 m 1 ½ ¼ ¼ 0 A Y! MS Y! A MS Y! ½ ½ 0 A ½ 0 0 MS 0 ½ 0

33 Teleports Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly

Suppose there are N pages Consider a page j, with set of outlinks O(j) We have M ij = 1/ O(j) when j i and M ij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1-β)/N reducing the probability of following each outlink from 1/ O(j) to β/ O(j) Equivalent: tax each page a fraction (1-β) of its score and redistribute evenly 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

35 Construct the N x N matrix A as follows A ij = β M ij + (1-β)/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix satisfying r = A r Equivalently, r is the stationary distribution of the random walk with teleports

36 Key step is matrix-vector multiplication r new = A r old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries 10 18 is a large number!

37 r = A r, where A ij = β M ij + (1-β)/N r i = 1 j N A ij r j r i = 1 j N [β M ij + (1-β)/N] r j = β 1 j N M ij r j + (1-β)/N 1 j N r j = β 1 j N M ij r j + (1-β)/N, since r = 1 r = βm r + [(1-β)/N] N where [x] N is an N-vector with all entries x

38 We can rearrange the PageRank equation r = β M r + [(1-β)/N] N [(1-β)/N] N is an N-vector with all entries (1-β)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = β M r old Add a constant value (1-β)/N to each entry in r new

39 Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won t fit in memory, but will fit on disk source node degree 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 destination nodes

Assume enough RAM to fit r new into memory Store r old and matrix M on disk Initialize all entries of r new to (1-β)/N For each page p (of out-degree n): Read into memory: p, n, dest 1,,dest n, r old (p) for j = 1 n: r new (dest j ) += β r old (p) / n 0 1 2 3 4 5 6 r new src degree destination 0 3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40 r old 0 1 2 3 4 5 6

41 In each iteration, we have to: Read r old and M Write r new back to disk IO Cost = 2 r + M Questions: What if we had enough memory to fit both r new and r old? What if we could not even fit r new in memory? See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf

46 Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank (next lecture) Uses a single measure of importance Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next) Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank (next lecture)

47 Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9

48

49

50

51 A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors h and a

52 HITS uses adjacency matrix A[i, j] = 1 if page i links to page j, 0 else A T, the transpose of A, is similar to the PageRank matrix M but A T has 1 s where M has fractions Yahoo A = y a m y 1 1 1 a 1 0 1 m 0 1 0 Amazon M soft

53 Yahoo A = y a m y 1 1 1 a 1 0 1 m 0 1 0 Amazon M soft

54 Notation: Vector a=(a 1,a n ), h=(h 1,h n ) Adjacency matrix (n x n): A ij =1 if i j else A ij =0 Then: h i = So: h = A a Likewise: a = j hi i j a = A T h j A ij a j

55 The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ A a Constant λ is a scaling factor, λ = 1/ h i The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = μ A T h Constant μ is scaling factor, μ = 1/ a i

56 The HITS algorithm: Initialize h, a to all 1 s Repeat: h = A a Scale h so that its sums to 1.0 a = A T h Scale a so that its sums to 1.0 Until h, a converge (i.e., change very little)

57 1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 Yahoo Amazon M soft a(yahoo) a(amazon) a(m soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1......... 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m soft) = 1 1 2/3 1/3 1 0.71 0.29 1 0.73 0.27......... 1.000 0.732 0.268

58 Algorithm: Set: a = h = 1 n Repeat: h=m a, a=m T h Normalize Then: a=m T (M a) new h new a Thus: a=(m T M) a h=(m M T ) h a is being updated (in 2 steps): M T (M a) = (M T M) a h is updated (in 2 steps): M (M T h) = (M M T ) h Repeated matrix powering

59 Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*: h* is the principal eigenvector of matrix AA T a* is the principal eigenvector of matrix A T A

60 PageRank and HITS are two solutions to the same problem: What is the value of an in-link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post-1998 were very different