CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data and Algorithms of the Web

CS6220: DATA MINING TECHNIQUES

Slides based on those in:

Link Mining PageRank. From Stanford C246

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Introduction to Data Mining

0.1 Naive formulation of PageRank

Jeffrey D. Ullman Stanford University

Link Analysis. Stony Brook University CSE545, Fall 2016

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Jeffrey D. Ullman Stanford University

A Note on Google s PageRank

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

IR: Information Retrieval

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

1998: enter Link Analysis

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Computing PageRank using Power Extrapolation

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Google Page Rank Project Linear Algebra Summer 2012

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Link Analysis Ranking

Data Mining Techniques

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Uncertainty and Randomization

Web Structure Mining Nodes, Links and Influence

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

ECEN 689 Special Topics in Data Science for Communications Networks

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

How does Google rank webpages?

Page rank computation HPC course project a.y

1 Searching the World Wide Web

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Singular Value Decompsition

Algebraic Representation of Networks

Deterministic Decentralized Search in Random Graphs

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Class President: A Network Approach to Popularity. Due July 18, 2014

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Data Mining Recitation Notes Week 3

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Matrices

Link Analysis & Ranking CS 224W

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Application. Stochastic Matrices and PageRank

Complex Social System, Elections. Introduction to Network Analysis 1

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis. Leonid E. Zhukov

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Analysis & Generative Model for Trust Networks

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Link Analysis and Web Search

Knowledge Discovery and Data Mining 1 (VO) ( )

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Lecture 7 Mathematics behind Internet Search

Updating PageRank. Amy Langville Carl Meyer

CS224W: Methods of Parallelized Kronecker Graph Generation

Math 304 Handout: Linear algebra, graphs, and networks.

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Applications. Nonnegative Matrices: Ranking

Improving Diversity in Ranking using Absorbing Random Walks

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Consider the following example of a linear system:

Lecture 1: From Data to Graphs, Weighted Graphs and Graph Laplacian

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Determining the Diameter of Small World Networks

Lecture 12: Link Analysis for Web Retrieval

Data Mining Techniques

CS 664 Segmentation (2) Daniel Huttenlocher

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS60021: Scalable Data Mining. Dimensionality Reduction

CS 188: Artificial Intelligence Spring Announcements

Introduction to Data Mining

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

eigenvalues, markov matrices, and the power method

CS47300: Web Information Search and Management

CS Algorithms and Complexity

Lecture: Local Spectral Methods (1 of 4)

Transcription:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data (~7.5GB compressed) Text of posts, plus category metadata e.g., match buyers and sellers

How big is the Web? Technically, infinite Much duplicaoon (30-40%) Best esomate of unique staoc HTML pages comes from search engine claims Google = 8 billion(?), Yahoo = 20 billion What is the structure of the Web? How is it organized? 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

Directed graph In early days of the Web links were navigaoonal Today many links are transacoonal 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

Two types of directed graphs: DAG directed acyclic graph: Has no cycles: if u can reach v, then v can not reach u Strongly connected: Any node can reach any node via a directed path Any directed graph can be expressed in terms of these two types 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

Strongly connected component (SCC) is a set of nodes S so that: Every pair of nodes in S can reach each other There is no larger set containing S with this property 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

Take a large snapshot of the web and try to understand how it s SCCs fit as a DAG. ComputaOonal issues: Say want to find SCC containing specific node v? ObservaOon: Out(v) nodes that can be reachable from v (BFS out) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G) where G is G with direcoons of all edge flipped 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

[Broder et al., 00] There is a giant SCC Broder et al., 2000: Giant weakly connected component: 90% of the nodes 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

[Broder et al., 00] 250 million webpages, 1.5 billion links [Altavista] 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

[Albert et al., 99] Diameter (average directed shortest path length) is 19 (in 1999) 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

[Broder et al., 00] Average distance: 75% of Ome there is no directed path from start to finish page Follow in- links (directed): 16.12 Follow out- links (directed): 16.18 Undirected: 6.83 Diameter of SCC (directed): At least 28 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

[Broder et al., 00] 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

Take real network plot a histogram of p k vs. k 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

Plot the same data on log- log axis: 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

0.1 0.1 1 10 100 1000 10000 100000 100000010000000 1000000001E+09 1E+10 1E+11 0.001 1E- 05 1E- 07 1E- 09 1E- 11 1E- 13 1E- 15 1E- 17 1E- 19 1E- 21 1E- 23 1E- 25 Exponential Y ~ e - X Power law: Y ~ X - 2 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

Power law degree exponent is typically 2 < α < 3 Web graph [Broder et al. 00]: α in = 2.1, α out = 2.4 Autonomous systems [Faloutsos et al. 99]: α = 2.4 Actor collaboraoons [Barabasi- Albert 00]: α = 2.3 CitaOons to papers [Redner 98]: α 3 Online social networks [Leskovec et al. 07]: α 2 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

Random network (Erdos- Renyi random graph) Degree distribution is Binomial Scale- free (power- law) network Degree distribution is Power- law Function is scale free if: f(ax) = c f(x) 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining Part 1-18

Web pages are not equally important www.joe- schmoe.com v www.stanford.edu Since there is big diversity in the connecovity of the webgraph we can rank pages by the link structure 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

First try: Page is more important if it has more links In- coming links? Out- going links? Think of in- links as votes: www.stanford.edu has 23,400 inlinks www.joe- schmoe.com has 1 inlink Are all in- links are equal? Links from important pages count more Recursive quesoon! 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

Each link s vote is proporoonal to the importance of its source page If page P with importance x has n out- links, each link gets x/n votes Page P s own importance is the sum of the votes on its in- links 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

The web in 1839 y/2 Yahoo y y = y /2 + a /2 a = y /2 + m m = a /2 a/2 y/2 Amazon a m a/2 M soft m 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

3 equaoons, 3 unknowns, no constants No unique soluoon All soluoons equivalent modulo scale factor AddiOonal constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian eliminaoon method works for small examples, but we need a be=er method for large web- size graphs 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

Matrix M has one row and one column for each web page Suppose page j has n out- links If j i, then M ij = 1/n else M ij = 0 M is a column stochasoc matrix Columns sum to 1 Suppose r is a vector with one entry per web page r i is the importance score of page i Call it the rank vector r = 1 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

Suppose page j links to 3 pages, including i j i 1/3 = i M r r 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

The flow equaoons can be wri=en r = Mr So the rank vector is an eigenvector of the stochasoc web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

Yahoo Y! A MS Y! ½ ½ 0 A ½ 0 1 MS 0 ½ 0 r = Mr Amazon M soft y = y /2 + a /2 a = y /2 + m m = a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

Simple iteraove scheme (aka relaxaoon) Suppose there are N web pages IniOalize: r 0 = [1/N,.,1/N] T Iterate: r k+1 = Mr k Stop when r k+1 - r k 1 < ε x 1 = 1 i N x i is the L1 norm Can use any other vector norm e.g., Euclidean 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28

Power iteraoon: Set r i =1/n Y! Y! A MS Y! ½ ½ 0 A ½ 0 1 r i = j M ij r j And iterate A MS MS 0 ½ 0 Example: y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6... 2/5 2/5 1/5 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29

Imagine a random web surfer At any Ome t, surfer is on some page P At Ome t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose i th component is the probability that the surfer is at page i at Ome t p(t) is a probability distribuoon on pages 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 30

Where is the surfer at Ome t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a staoonary distribuoon for the random walk Our rank vector r saosfies r = Mr So it is a staoonary distribuoon for the random surfer 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 31

A central result from the theory of random walks (aka Markov processes): For graphs that saosfy certain condioons, the staoonary distribuoon is unique and eventually will be reached no ma=er what the inioal probability distribuoon at Ome t = 0. 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 32

Some pages are dead ends (have no out- links) Such pages cause importance to leak out Spider traps (all out links are within the group) Eventually spider traps absorb all importance 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 33

A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the condioons needed for the random walk theorem 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 34

Power iteraoon: Set r i =1 r i = j M ij r j And iterate Example: A Y! MS Y! A MS Y! ½ ½ 0 A ½ 0 0 MS 0 ½ 1 y 1 1 ¾ 5/8 0 a = 1 ½ ½ 3/8 0 m 1 3/2 7/4 2 3 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 35

The Google soluoon for spider traps At each Ome step, the random surfer has two opoons: With probability β, follow a link at random With probability 1- β, jump to some page uniformly at random Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few Ome steps 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 36

0.2*1/3 Yahoo 1/2 0.8*1/2 1/2 0.8*1/2 0.2*1/3 0.2*1/3 Amazon M soft y y 1/2 a 1/2 m 0 0.8* y 1/2 1/2 0 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 + 0.2* y 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 37

Yahoo 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 Amazon M soft y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688... 7/11 5/11 21/11 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 38

Power iteraoon: Set r i =1 r i = j M ij r j And iterate Example: A Y! MS Y! A MS Y! ½ ½ 0 A ½ 0 0 MS 0 ½ 0 y 1 1 ¾ 5/8 0 a = 1 ½ ½ 3/8 0 m 1 ½ ¼ ¼ 0 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 39

Teleport Follow random teleport links with probability 1.0 from dead- ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead- ends Might require mulople passes Compute page rank on reduced graph Approximate values for deadends by propagaong values from reduced graph

Suppose there are N pages Consider a page j, with set of outlinks O(j) We have M ij = 1/ O(j) when j i and M ij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1- β)/n reducing the probability of following each outlink from 1/ O(j) to β/ O(j) Equivalent: tax each page a fracoon (1- β) of its score and redistribute evenly 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 41

Construct the N x N matrix A as follows A ij = βm ij + (1- β)/n Verify that A is a stochasoc matrix The page rank vector r is the principal eigenvector of this matrix saosfying r = Ar Equivalently, r is the staoonary distribuoon of the random walk with teleports 1/26/10 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 42

Key step is matrix- vector muloplicaoon r new = Ar old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries 10 18 is a large number!

r = Ar, where A ij = βm ij + (1- β)/n r i = 1 j N A ij r j r i = 1 j N [βm ij + (1- β)/n] r j = β 1 j N M ij r j + (1- β)/n 1 j N r j = β 1 j N M ij r j + (1- β)/n, since r = 1 r = βmr + [(1- β)/n] N where [x] N is an N- vector with all entries x

We can rearrange the page rank equaoon: r = βmr + [(1- β)/n] N [(1- β)/n] N is an N- vector with all entries (1- β)/n M is a sparse matrix! 10 links per node, approx 10N entries So in each iteraoon, we need to: Compute r new = βmr old Add a constant value (1- β)/n to each entry in r new

Encode sparse matrix using only nonzero entries Space proporoonal roughly to number of links say 10N, or 4*10*1 billion = 40GB soll won t fit in memory, but will fit on disk source node degree 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 destination nodes

Assume we have enough RAM to fit r new, plus some working memory Store r old and matrix M on disk Basic Algorithm: IniOalize: r old = [1/N] N Iterate: Update: Perform a sequenoal scan of M and r old to update r new Write out r new to disk as r old for next iteraoon Every few iteraoons, compute r new - r old and stop if it is below threshold Need to read in both vectors into memory

Initialize all entries of r new to (1-β)/N For each page p (out-degree n): Read into memory: p, n, dest 1,,dest n, r old (p) for j = 1..n: r new (dest j ) += β*r old (p)/n 0 1 2 3 4 5 6 r new src degree destination 0 3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23 r old 0 1 2 3 4 5 6

In each iteraoon, we have to: Read r old and M Write r new back to disk IO Cost = 2 r + M What if we had enough memory to fit both r new and r old? What if we could not even fit r new in memory? 10 billion pages

0 1 2 3 r new src degree destination 0 4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4 r old 0 1 2 3 4 5 4 5

Similar to nested- loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block k scans of M and r old k( M + r ) + r = k M + (k+1) r Can we do be=er? Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k Omes per iteraoon

0 1 2 3 r new src degree destination 0 4 0, 1 1 3 0 2 2 1 0 4 3 2 2 3 r old 0 1 2 3 4 5 4 5 0 4 5 1 3 5 2 2 4

Break M into stripes Each stripe contains only desonaoon nodes in the corresponding block of r new Some addioonal overhead per stripe But usually worth it Cost per iteraoon M (1+ε) + (k+1) r