Link Analysis. Stony Brook University CSE545, Fall 2016

Similar documents
Link Mining PageRank. From Stanford C246

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Data and Algorithms of the Web

0.1 Naive formulation of PageRank

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Slides based on those in:

CS6220: DATA MINING TECHNIQUES

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

1998: enter Link Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Introduction to Data Mining

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis. Leonid E. Zhukov

Link Analysis Ranking

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

Online Social Networks and Media. Link Analysis and Web Search

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Updating PageRank. Amy Langville Carl Meyer

Uncertainty and Randomization

CPSC 540: Machine Learning

ECEN 689 Special Topics in Data Science for Communications Networks

Lecture 12: Link Analysis for Web Retrieval

Jeffrey D. Ullman Stanford University

Graph Models The PageRank Algorithm

CPSC 540: Machine Learning

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Computing PageRank using Power Extrapolation

A Note on Google s PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Data Mining and Matrices

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Inf 2B: Ranking Queries on the WWW

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Jeffrey D. Ullman Stanford University

IR: Information Retrieval

The Theory behind PageRank

The Second Eigenvalue of the Google Matrix

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Google Page Rank Project Linear Algebra Summer 2012

COMPSCI 514: Algorithms for Data Science

Page rank computation HPC course project a.y

arxiv:cond-mat/ v1 3 Sep 2004

Data Mining Techniques

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Calculating Web Page Authority Using the PageRank Algorithm

Web Structure Mining Nodes, Links and Influence

Data Mining Recitation Notes Week 3

MATH36001 Perron Frobenius Theory 2015

Applications. Nonnegative Matrices: Ranking

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Analysis and Computation of Google s PageRank

On the mathematical background of Google PageRank algorithm

STA141C: Big Data & High Performance Statistical Computing

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Entropy Rate of Stochastic Processes

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Node Centrality and Ranking on Networks

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Mathematical Properties & Analysis of Google s PageRank

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Machine Learning CPSC 340. Tutorial 12

How does Google rank webpages?

Analysis of Google s PageRank

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Node and Link Analysis

eigenvalues, markov matrices, and the power method

Updating Markov Chains Carl Meyer Amy Langville

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Application. Stochastic Matrices and PageRank

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Convex Optimization CMU-10725

Computer Engineering II Solution to Exercise Sheet Chapter 12

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Algebraic Representation of Networks

1 Searching the World Wide Web

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Complex Social System, Elections. Introduction to Network Analysis 1

The PageRank Computation in Google: Randomization and Ergodicity

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Finding central nodes in large networks

Transcription:

Link Analysis Stony Brook University CSE545, Fall 2016

The Web, circa 1998

The Web, circa 1998

The Web, circa 1998 Match keywords, language (information retrieval) Explore directory

The Web, circa 1998 Easy to game with term spam Match keywords, language (information retrieval) Explore directory Time-consuming; Not open-ended

Enter PageRank...

Key Idea: Consider the citations of the website in addition to keywords.

Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations?

Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? The Web as a directed graph:

Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? Innovation 2: Not just own terms but what terms are used by citations?

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links as votes Innovation 2: Not just own terms but what terms are used by citations?

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 2: Not just own terms but what terms are used by citations?

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 2: Not just own terms but what terms are used by citations?

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. How to compute? Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Use recursion to figure out if each page is important. Innovation 2: Not just own terms but what terms are used by citations?

A C B D How to compute? Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A C r A /1 B r B /4 r C /2 D r D = r A /1 + r B /4 + r C /2 How to compute? Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A C B D How to compute? Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A B A system of equations? C How to compute? D Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A B A system of equations? C How to compute? D Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A B A system of equations? Provides intuition, but impractical to solve at scale. C D How to compute? Each page (j) has an importance (i.e. rank, r j ) (n j is out-links ) Innovation 2: Not just own terms but what terms are used by citations?

A C B D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M

A C B D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,]

A C B D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24]

A C B D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

A B Power iteration algorithm Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] err_norm(v1, v2) = v1 - v2 #L1 norm C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

A B Power iteration algorithm Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] err_norm(v1, v2) = v1 - v2 #L1 norm C D to \ from A B C D A 0 1/2 1 0 B 1/3 0 0 1/2 C 1/3 0 0 1/2 D 1/3 1/2 0 0 Transition Matrix, M To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

Power iteration algorithm Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] As err_norm gets smaller we are moving toward: r = M r We are actually just finding the eigenvector of M. err_norm(v1, v2) = v1 - v2 #L1 norm To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

Power iteration algorithm Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] As err_norm gets smaller we are moving toward: r = M r We are actually just finding the eigenvector of M. x is an eigenvector of if: A x = x err_norm(v1, v2) = v1 - v2 #L1 norm To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

As err_norm gets smaller we are moving toward: r = M r Power iteration algorithm finds the... Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] err_norm(v1, v2) = v1 - v2 #L1 norm We are actually just finding the eigenvector of M. x is an eigenvector of if: A x = x A = 1 since columns of M sum to 1. thus, 1r=Mr To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

As err_norm gets smaller we are moving toward: r = M r Power iteration algorithm finds the... Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] err_norm(v1, v2) = v1 - v2 #L1 norm We are actually just finding the eigenvector of M. x is an eigenvector of if: A x = x A = 1 since columns of M sum to 1. thus, 1r=Mr To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M r) = M 2 r = [15/48, 11/48, ]

Power iteration algorithm Initialize: r[0] = [1/N,, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M r[t] t+=1 solution = r[t] err_norm(v1, v2) = v1 - v2 #L1 norm Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node.

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node.

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends a node can t propagate its rank No spider traps set of nodes with no way out. Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node.

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends a node can t propagate its rank No spider traps set of nodes with no way out. (technically: it needs to be: stochastic, irreducible, and aperiodic ) Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node.

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends a node can t propagate its rank No spider traps set of nodes with no way out. A C B D to \ from A B C D A 0 0 1 0 B 1/3 0 0 1 C 1/3 0 0 0 Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node. r would eventually converge to [0, 0, ]

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends a node can t propagate its rank No spider traps set of nodes with no way out. Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node. A B C D

aka 1st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends a node can t propagate its rank No spider traps set of nodes with no way out. (technically: it needs to be: stochastic, irreducible, and aperiodic ) Where is surfer at time t+1? p(t+1) = M p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node. columns sum to 1 same node doesn t repeat at a regular interval non-zero chance of going to any another node

No dead-ends No spider traps The Google PageRank Formulation Add teleportation:at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) A B C D

No dead-ends No spider traps The Google PageRank Formulation Add teleportation:at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) A B C D

No dead-ends No spider traps The Google PageRank Formulation Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) Add teleportation from dead end with probability 1 A C B D to \ from A B C D A 0 0 1 0 B 1/3 0 0 1 C 1/3 0 0 0 D 1/3 0 0 0

No dead-ends No spider traps The Google PageRank Formulation Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) Add teleportation from dead end with probability 1 A C B D to \ from A B C D A 0 ¼ 1 0 B 1/3 ¼ 0 1 C 1/3 ¼ 0 0 D 1/3 ¼ 0 0

No dead-ends No spider traps The Google PageRank Formulation Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) Add teleportation from dead end with probability 1 to \ from A B C D (Brin and Page, 1998) A 0 0 1 0 A C B D B 1/3 0 0 1 C 1/3 0 0 0 D 1/3 0 0 0

No dead-ends No spider traps Flow Model The Google PageRank Formulation Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) Add teleportation from dead end with probability 1 Matrix Model to \ from A B C D A B... (Brin and Page, 1998) A B A 0 0 1 0 B 1/3 0 0 1 C 1/3 0 0 0 A ⅘*0 + ⅕*¼ ¼ B ⅘*⅓ + ⅕*¼ ¼ C ⅘*⅓ + ⅕*¼ ¼ assume = ⅘ C D D 1/3 0 0 0 D ⅘*⅓ + ⅕*¼ ¼

No dead-ends No spider traps The Google PageRank Formulation Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1- ) Add teleportation from dead end with probability 1 Matrix Model To apply: run power iterations over M instead of M. A B... A ⅘*0 + ⅕*¼ ¼ B ⅘*⅓ + ⅕*¼ ¼ C ⅘*⅓ + ⅕*¼ ¼ D ⅘*⅓ + ⅕*¼ ¼ assume = ⅘