Slides based on those in:

Similar documents
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Mining PageRank. From Stanford C246

Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Data and Algorithms of the Web

CS6220: DATA MINING TECHNIQUES

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS6220: DATA MINING TECHNIQUES

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS249: ADVANCED DATA MINING

Online Social Networks and Media. Link Analysis and Web Search

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis. Stony Brook University CSE545, Fall 2016

0.1 Naive formulation of PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

A Note on Google s PageRank

Jeffrey D. Ullman Stanford University

1998: enter Link Analysis

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Jeffrey D. Ullman Stanford University

IR: Information Retrieval

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Link Analysis Ranking

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Computing PageRank using Power Extrapolation

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Uncertainty and Randomization

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lecture 12: Link Analysis for Web Retrieval

Data Mining Recitation Notes Week 3

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Google Page Rank Project Linear Algebra Summer 2012

Link Analysis. Leonid E. Zhukov

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Data Mining and Matrices

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

STA141C: Big Data & High Performance Statistical Computing

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Web Structure Mining Nodes, Links and Influence

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Machine Learning CPSC 340. Tutorial 12

eigenvalues, markov matrices, and the power method

Data Mining Techniques

How does Google rank webpages?

Page rank computation HPC course project a.y

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Link Analysis & Ranking CS 224W

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Inf 2B: Ranking Queries on the WWW

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Mathematical Properties & Analysis of Google s PageRank

1 Searching the World Wide Web

Complex Social System, Elections. Introduction to Network Analysis 1

CS246 Final Exam, Winter 2011

Lecture 7 Mathematics behind Internet Search

Degree Distribution: The case of Citation Networks

Finding central nodes in large networks

Node Centrality and Ranking on Networks

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Graph Models The PageRank Algorithm

Analysis of Google s PageRank

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

CS246 Final Exam. March 16, :30AM - 11:30AM

Applications. Nonnegative Matrices: Ranking

Lecture: Local Spectral Methods (1 of 4)

To Randomize or Not To

Faloutsos, Tong ICDE, 2009

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Analysis and Computation of Google s PageRank

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

arxiv:cond-mat/ v1 3 Sep 2004

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

CS47300: Web Information Search and Management

Math 304 Handout: Linear algebra, graphs, and networks.

The Push Algorithm for Spectral Ranking

Application. Stochastic Matrices and PageRank

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Class President: A Network Approach to Popularity. Due July 18, 2014

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

Recommendation Systems

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Node and Link Analysis

Transcription:

Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommend er systems Clustering Community Detection Web advertising Decision Trees Association Rules Dimensionali ty reduction Spam Detection Queries on streams Perceptron, knn Duplicate document detection 2

Facebook social graph 4 degrees of separation [Backstrom Boldi Rosa Ugander Vigna, 2011] 3

Connections between political blogs Polarization of the network [Adamic Glance, 2005] 4

Citation networks and Maps of science [Börner et al., 2012] 5

domain2 domain1 router domain3 Internet 6

Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 7

Web as a directed graph: Nodes: Webpages Edges: Hyperlinks I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University 8

Web as a directed graph: Nodes: Webpages Edges: Hyperlinks I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University 9

10

How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc. 11

2 challenges of web search: (1) Web contains many sources of information Who to trust? Trick: Trustworthy pages may point to each other! (2) What is the best answer to query newspaper? No single right answer Trick: Pages that actually know about newspapers might all be pointing to many newspapers 12

All web pages are not equally important www.joe schmoe.com vs. www.stanford.edu There is large diversity in the web graph node connectivity. Let s rank the pages by the link structure! 13

We will cover the following Link Analysis approaches for computing the importance of nodes in a graph: Page Rank Topic Specific (Personalized) Page Rank Web Spam Detection Algorithms 14

Idea: Links as votes Page is more important if it has more links Incoming links? Outgoing links? Think of in links as votes: www.stanford.edu has 23,400 in links www.joe schmoe.com has 1 in link Are all in links are equal? Links from important pages count more Recursive question! 16

A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 17

Each link s vote is proportional to the importance of its source page If page j with importance r j has n out links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in links r j = r i /3 + r k /4 i k r i /3 rk /4 j r j /3 r j /3 r j /3 18

A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for page j r j i j r i d i out-degree of node a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m 19

Flow equations: r y = r y /2 + r a /2 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo the scale factor Additional constraint forces uniqueness: r a = r y /2 + r m r m = r a /2 Solution: Gaussian elimination method works for small examples, but we need a better method for large web size graphs We need a new formulation! 20

Stochastic adjacency matrix Let page has out links If, then else is a column stochastic matrix Columns sum to 1 Rank vector : vector with an entry per page is the importance score of page The flow equations can be written r j i j r i d i 21

Remember the flow equation: rj Flow equation in the matrix form i j r i d i Suppose page i links to 3 pages, including j i 1/3 j. r i = r j M. r = r 22

The flow equations can be written So the rank vector r is an eigenvector of the stochastic web matrix M In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Largest eigenvalue of M is 1 since M is column stochastic (with non negative entries) We know r is unit length and each column of M sums to one, so We can now efficiently solve for r! The method is called Power iteration NOTE: x is an eigenvector with the corresponding eigenvalue λ if: 23

a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r = M r r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m 24

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r (0) = [1/N,.,1/N] T Iterate: r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < x 1 = 1 i N x i is the L1 norm Can use any other vector norm, e.g., Euclidean r ( t1) j i j ( t) i r d d i. out-degree of node i i 25

Power Iteration: Set /N 1: 2: Goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration: 0, 1, 2, 3, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 26

Power Iteration: Set /N 1: 2: Goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration: 0, 1, 2, 3, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 27

Details! Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue) Claim: Sequence approaches the dominant eigenvector of 28

Details! Claim: Sequence approaches the dominant eigenvector of Proof: Assume M has n linearly independent eigenvectors, with corresponding eigenvalues, where Vectors form a basis and thus we can write: Repeated multiplication on both sides produces 29

Details! Claim: Sequence approaches the dominant eigenvector of Proof (continued): Repeated multiplication on both sides produces Since and so Thus: then fractions as (for all ). 30 Note if 0 then the method won tconverge

i 1 i 2 i 3 j r j i j d out ri (i) 31

Where is the surfer at time t+1? Follows a link uniformly at random Suppose the random walk reaches a state i 1 i 2 i 3 j p( t 1) M p( t) then is stationary distribution of a random walk Our original rank vector satisfies So, is a stationary distribution for the random walk 32

A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 33

r ( t1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want? Are results reasonable? 35

a b Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration: 0, 1, 2, r ( t1) j i j r ( t) i d i 36

a b Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration: 0, 1, 2, r ( t1) j i j r ( t) i d i 37

2 problems: (1) Some pages are dead ends (have no out links) Random walk has nowhere to go to Such pages cause importance to leak out Dead end (2) Spider traps: (all out links are within the group) Spider trap Random walked gets stuck in a trap And eventually spider traps absorb all importance 38

Power Iteration: Set And iterate y a m m is a spider trap y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration: 0, 1, 2, 3, All the PageRank score gets trapped in node m. r m = r a /2 + r m 39

The Google solution for spider traps: At each time step, the random surfer has two options With probability, follow an out link at random. With probability 1, teleport (i.e., jump) to some random page. Common values for are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 40

Power Iteration: Set And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration: 0, 1, 2, 3, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 Here the PageRank leaks out since the matrix is not stochastic. 41

Teleports: Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ 42

Why are dead ends and spider traps a problem and why do teleports solve the problem? Spider traps are not a problem, but with traps PageRank scores are not what we want Solution: Never get stuck in a spider trap, by teleporting out of it in a finite number of steps Dead ends are a problem The matrix is not column stochastic so our initial assumptions are not met Solution: Make matrix column stochastic, by always teleporting from a dead end (where there is nowhere else to go) 43

Google s solution that does it all: At each step, random surfer has two options: With probability, follow a link at random With probability 1-, jump to some random page PageRank equation [Brin Page, 98] d i out degree of node i This formulation assumes that has no dead ends. We can either preprocess matrix to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. 44

PageRank equation [Brin Page, 98] The Google Matrix A: [1/N] NxN N by N matrix where all entries are 1/N We have a recursive problem: And the Power method still works! What is? In practice =0.8,0.9 (make 5 steps on avg., jump) 45

y 7/15 M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N] NxN 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 a m 13/15 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56... Iteration: 0, 1, 2, 3, 7/33 5/33 21/33 46

Key step is matrix vector multiplication r new = A r old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N 2 entries 10 18 is a large number! A = A = M + (1 ) [1/N] NxN ½ ½ 0 ½ 0 0 0 ½ 1 0.8 +0.2 = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 48

Suppose there are N pages Consider page i, with d i out links We have: M ji = 1/ d i, when i j M ji = 0, otherwise The random teleport is equivalent to: Adding a teleport link from i to every other page and setting transition probability to (1 ) / N. Reducing the probability of following each out link from 1/ d i to / d i. Equivalent: Tax each page a fraction (1 ) of its score and redistribute evenly. 49

, where So we get: since Note: Here we assumed M has no dead-ends [x] N a vector of length N with all entries x 50

We just rearranged the PageRank equation where [(1 )/N] N is a vector with all N entries (1 )/N M is a sparse matrix! (with no dead ends) 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = M r old Add a constant value (1 )/N to each entry in r new Note if M contains dead ends then and we also have to renormalize r new so that it sums to 1 51

Input: Graph Directed graph and parameter Parameter Output: PageRank vector (can have spider traps and dead ends) Set: repeat until convergence: : if in degree of is 0 Now re insert the leaked PageRank: : where: NOTE: If the graph has no dead-ends then the amount of leaked PageRank is 1-β. Since we have deadends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. 52

Encode sparse matrix using only nonzero entries Space proportional roughly to number of links Say 10N, or 4*10*1 billion = 40GB Still won t fit in memory, but will fit on disk source node degree destination nodes 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 53

Assume enough RAM to fit r new into memory Store r old and matrix M on disk 1 step of power iteration is: Initialize all entries of r new = (1-) / N For each page j(of out-degree d j ) do: Read into memory: j, d j, dest 1,, dest d j, r old (j) For i = 1 d j do r new (dest i ) += r old (j) / d j 0 1 2 3 4 5 6 r new source degree 0 3 1, 5, 6 destination 1 4 17, 64, 113, 117 2 2 13, 23 r old 0 1 2 3 4 5 6 54

Assume enough RAM to fit r new into memory Store r old and matrix M on disk In each iteration, we have to: Read r old and M Write r new back to disk Cost, per iteration, of Power Iteration method: = 2 r + M Question: What if we could not even fit r new in memory? 55

0 1 2 3 r new src degree destination 0 4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4 M r old 0 1 2 3 4 5 4 5 Break r new into k blocks that fit in memory Scan M and r old once for each block 56

Similar to nested loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block Total cost: k scans of M and r old Cost, per iteration, of Power Iteration method: k( M + r ) + r = k M + (k+1) r Can we do better? Hint: M is much bigger than r (approx 10 20x), so we must avoid reading it k times per iteration 57

0 1 2 3 r new src degree destination 0 4 0, 1 1 3 0 2 2 1 0 4 3 2 2 3 r old 0 1 2 3 4 5 4 5 0 4 5 1 3 5 2 2 4 Break M into stripes! Each stripe contains only destination nodes in the corresponding block of r new 58

Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new Some additional overhead per stripe Must read the out degree in each block But it is usually worth it Cost per iteration of Power method: M (1+) + (k+1) r 59

Measures generic popularity of a page Biased against topic specific authorities Solution: Topic Specific PageRank Uses a single measure of importance Other models of importance Solution: HITS (hubs and authorities) Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank 60