Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Size: px

Start display at page:

Download "Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University."

Solomon Day
6 years ago
Views:

1 Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

2 #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM Classification (58 votes) #4: Apriori - Frequent Itemsets (52 votes) #5: EM Clustering (48 votes) #6: PageRank Link mining (46 votes) #7: AdaBoost Boosting (45 votes) #7: knn Classification (45 votes) #7: Naive Bayes Classification (45 votes) #10: CART Classification (34 votes) Data Mining: Concepts and Techniques 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 How to organize the Web?

3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Content based: Find relevant docs Works well in a small and trusted set, e.g. Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc.

4 Data Mining: Concepts and Techniques 6 Link based ranking algorithms PageRank HITS

5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7

6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 All web pages are not equally important vs. There is large diversity in the web-graph node connectivity. Let s rank the pages by the link structure!

7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9 Idea: Links as votes Page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: has 23,400 in-links has 1 in-link Are all in-links are equal? Links from important pages count more Recursive question!

8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 10 A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F

9 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 11 Each link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links r j = r i /3+r k /4 i k r i /3 rk /4 j r j /3 r j /3 r j /3

10 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 12 A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for page j r j i j r i d i d i out-degree of node i a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m

11 3 equations, 3 unknowns, no constants r a No unique solution All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: r y + r a + r m = 1 Flow equations: r y = r y /2 + r a /2 = r y /2 + r m r m = r a /2 Solution: r y = 2 5, r a = 2 5, r m = 1 5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 13

12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 14 Stochastic adjacency matrix M Let page i has d i out-links If i j, then M ji = 1 d i else M ji = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r: vector with an entry per page r i is the importance score of page i i r i = 1 The flow equations can be written r r = M r j i j r i d i

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.

13 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 15 Remember the flow equation: rj Flow equation in the matrix form M r = r Suppose page i links to 3 pages, including j i r i i j d i 1/3 j. r i = r j M. r = r

14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 16 a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r = M r r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 17 The flow equations can be written r = M r So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Largest eigenvalue of M is 1 since M is column stochastic (with non-negative entries) We can now efficiently solve for r! The method is called Power iteration NOTE: x is an eigenvector with the corresponding eigenvalue λ if: Ax = λx

16 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 18 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r (0) = [1/N,.,1/N] T Iterate: r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < x 1 = 1 i N x i is the L1 norm Can use any other vector norm, e.g., Euclidean r ( t 1) j i j ( t) i r d d i. out-degree of node i i

17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 19 a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2,

18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 20 Imagine a random web surfer: At any time t, surfer is on some page i At time t + 1, the surfer follows an out-link from i uniformly at random Ends up on some page j linked from i Process repeats indefinitely Let: p(t) vector whose i th coordinate is the prob. that the surfer is at page i at time t So, p(t) is a probability distribution over pages r j i 1 i 2 i 3 j i j d out ri (i)

19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 21 Where is the surfer at time t+1? Follows a link uniformly at random p t + 1 = M p(t) Suppose the random walk reaches a state p t + 1 = M p(t) = p(t) then p(t) is stationary distribution of a random walk Our original rank vector r satisfies r = M r So, r is a stationary distribution for the random walk i 1 i 2 i 3 j p( t 1) M p( t)

20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 22 r ( t 1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want?

21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 23 a b Example: r a = r b Iteration 0, 1, 2, r ( t 1) j i j r ( t) i d i

22 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 24 a b Example: r a = r b r ( t 1) j i j r ( t) i d i Iteration 0, 1, 2,

23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 25 A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected, no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

24 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 26

25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, problems: (1) Some pages are dead ends (have no out-links) Random walk has nowhere to go to Such pages cause importance to leak out Dead end (2) Spider traps: (all out-links are within the group) Random walked gets stuck in a trap And eventually spider traps absorb all importance

26 a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 m is a spider trap Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, All the PageRank score gets trapped in node m. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 28

27 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 29 The Google solution for spider traps: At each time step, the random surfer has two options With prob., follow a link at random With prob. 1-, jump to some random page Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m

28 a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, Here the PageRank leaks out since the matrix is not stochastic. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 30

29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 31 Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓

30 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 32 Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps PageRank scores are not what we want Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps Dead-ends are a problem The matrix is not column stochastic so our initial assumptions are not met Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33 Google s solution that does it all: At each step, random surfer has two options: With probability, follow a link at random With probability 1-, jump to some random page PageRank equation [Brin-Page, 98] r j = i j β r i d i + (1 β) 1 N d i out-degree of node i This formulation assumes that M has no dead ends. We can either preprocess matrix M to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.

32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34 PageRank equation [Brin-Page, 98] r j = i j β r i d i + (1 β) 1 N The Google Matrix A: A = β M + 1 β 1 N N N We have a recursive problem: And the Power method still works! What is? [1/N] NxN N by N matrix where all entries are 1/N In practice =0.8,0.9 (make 5 steps on avg., jump)

33 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 35 y 7/15 M 1/2 1/ / /2 1 [1/N] NxN 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 a m 13/15 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 1/3 1/ /33 5/33 21/33

34 Input: Graph G and parameter β Directed graph G (can have spider traps and dead ends) Parameter β Output: PageRank vector r new Set: r old j = 1 N repeat until convergence: j r new j r old j > ε j: r new j = i j β r i old d i r new j = 0 if in-degree of j is 0 Now re-insert the leaked PageRank: j: r new j = r j new + 1 S new where: S = N j r j r old = r new If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 36

35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 39 Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank Uses a single measure of importance Other models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank

Slides based on those in:

Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering