Introduction to Data Mining

Similar documents
Slides based on those in:

Link Mining PageRank. From Stanford C246

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Online Social Networks and Media. Link Analysis and Web Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Data and Algorithms of the Web

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

CS6220: DATA MINING TECHNIQUES

Link Analysis. Stony Brook University CSE545, Fall 2016

CS249: ADVANCED DATA MINING

CS6220: DATA MINING TECHNIQUES

0.1 Naive formulation of PageRank

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

A Note on Google s PageRank

Link Analysis Ranking

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Data Mining Recitation Notes Week 3

1998: enter Link Analysis

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

IR: Information Retrieval

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Link Analysis. Leonid E. Zhukov

Lecture 12: Link Analysis for Web Retrieval

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Data Mining and Matrices

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Web Structure Mining Nodes, Links and Influence

Lecture: Local Spectral Methods (1 of 4)

Jeffrey D. Ullman Stanford University

Google Page Rank Project Linear Algebra Summer 2012

Lecture 7 Mathematics behind Internet Search

Jeffrey D. Ullman Stanford University

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Complex Social System, Elections. Introduction to Network Analysis 1

ECEN 689 Special Topics in Data Science for Communications Networks

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Mathematical Properties & Analysis of Google s PageRank

1 Searching the World Wide Web

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Faloutsos, Tong ICDE, 2009

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

STA141C: Big Data & High Performance Statistical Computing

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis & Ranking CS 224W

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Analysis of Google s PageRank

Class President: A Network Approach to Popularity. Due July 18, 2014

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Page rank computation HPC course project a.y

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Computing PageRank using Power Extrapolation

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Degree Distribution: The case of Citation Networks

Uncertainty and Randomization

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

COMPSCI 514: Algorithms for Data Science

Graph Models The PageRank Algorithm

CS47300: Web Information Search and Management

Finding central nodes in large networks

Models and Algorithms for Complex Networks. Link Analysis Ranking

Machine Learning CPSC 340. Tutorial 12

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Node and Link Analysis

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

The Push Algorithm for Spectral Ranking

Eigenvalue Problems Computation and Applications

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Definition: A sequence is a function from a subset of the integers (usually either the set

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Node Centrality and Ranking on Networks

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

eigenvalues, markov matrices, and the power method

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

Transcription:

Introduction to Data Mining Lecture #9: Link Analysis Seoul National University 1

In This Lecture Motivation for link analysis Pagerank: an important graph ranking algorithm Flow and random walk formulation 2

Outline Overview PageRank: Flow Formulation 3

Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] 4

Graph Data: Media Networks Connections between political blogs Polarization of the network [Adamic-Glance, 2005] 5

Graph Data: Information Nets Citation networks and Maps of science [Börner et al., 2012] 6

Graph Data: Communication Nets domain2 domain1 router domain3 Internet 7

Graph Data: Classic Example Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 8

Web as a Graph Web as a directed graph: Nodes: Webpages Edges: Hyperlinks This is a class on Data Mining. Classes are in the 302 building 302 building is located near SNU SNU homepage 9

Web as a Graph Web as a directed graph: Nodes: Webpages Edges: Hyperlinks This is a class on Data Mining. Classes are in the 302 building 302 building is located near SNU SNU homepage 10

Web as a Directed Graph 11

Broad Question How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set Newspaper articles, Patents, etc. But: Web is huge, full of untrusted documents, random things, web spam, etc. 12

Web Search: 2 Challenges 2 challenges of web search: (1) Web contains many sources of information Who to trust? Idea: Trustworthy pages may point to each other! (2) What is the best answer to the query newspaper? No single right answer Idea: Pages that actually know about newspapers might all be pointing to many newspapers 13

Ranking Nodes on the Graph All web pages are not equally important www.joe-schmoe.com vs. www.snu.ac.kr There is large diversity in the web-graph node connectivity. Let s rank the pages by the link structure! 14

Link Analysis Algorithms We will cover the following Link Analysis approaches for computing importance of nodes in a graph: Page Rank Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms 15

Outline Overview PageRank: Flow Formulation 16

Links as Votes Idea: Links as votes A page is more important if it has more links In-coming links? Out-going links? Think of in-links as votes: www.snu.ac.kr has 100,000 in-links www.joe-schmoe.com has 1 in-link Are all in-links equal? Links from important pages count more Recursive question! 17

Example: PageRank Scores A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 18

Simple Recursive Formulation Each link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links r j = r i /3+r k /4 i k r i /3 r k /4 j r j /3 r j /3 r j /3 19

PageRank: The Flow Model A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for page j r j = i j r i d i dd ii out-degree of node ii i -> j : all i that point to j a/2 a a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 20 m

3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo the scale factor I.e., Multiplying c to given a solution r y, r a, r m will give you another solution Additional constraint forces uniqueness: Solving the Flow Equations rr yy + rr aa + rr mm = 11 Solution: rr yy = 22 55, rr aa = 22 55, rr mm = 11 55 Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs We need a new formulation! 21

PageRank: Matrix Formulation Stochastic adjacency matrix MM Let page ii has dd ii out-links If ii jj, then MM jjii = 1 dd ii else MM jjii = 0 MM is a column stochastic matrix Each column sums to 1 NOTE: A matrix M is called `column stochastic if the sum of each column is 1 a y m Destination Source y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 MM 22

PageRank: Matrix Formulation Rank vector rr: vector with an entry per page rr ii is the importance score of page ii ii rr ii = 1 ri r = The flow equations Why? j i j d i rr = MM rr can be written r 2 r j = x 1 1 1 r 5 d 2 d 5 d 9 r 9 23

Eigenvector Formulation The flow equations can be written rr = MM rr NOTE: x is an eigenvector with the corresponding eigenvalue λ if: AAAA = λλλλ So the rank vector r is an eigenvector of the web matrix M, with the corresponding eigenvalue 1 Fact: The largest eigenvalue of a column stochastic matrix is 1 We can now efficiently solve for r! The method is called Power iteration 24

Example: Flow Equations & M a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r = M r r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m 25

Power Iteration Method Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme Suppose there are N web pages Initialize: r (0) = [1/N,.,1/N] T Iterate: r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < ε x 1 = 1 i N x i (called the L1 norm) Can use any other vector norm, e.g., Euclidean r ( t+ 1) j = i j ( t) i r d d i. out-degree of node i i 26

Power Iteration Method Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue) rr (11) = MM rr (00) rr (22) = MM rr 11 = MM MMrr 11 = MM 22 rr 00 rr (33) = MM rr 22 = MM MM 22 rr 00 = MM 33 rr 00 Fact: Sequence MM rr 00, MM 22 rr 00, MM kk rr 00, approaches the dominant eigenvector of MM Dominant eigenvector = the one corresponding to the largest eigenvalue 27

PageRank: How to solve? Power Iteration: Set rr jj = 1/N rr 1: rrr jj = ii ii jj dd ii 2: rr = rrr Goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 28

Random Walk Interpretation Imagine a random web surfer: Let: At any time tt, surfer is on some page ii At time tt + 11, the surfer follows an out-link from ii uniformly at random Ends up on some page jj linked from ii Process repeats indefinitely pp(tt) vector whose ii th coordinate is the prob. that the surfer is at page ii at time tt So, pp(tt) is a probability distribution over pages r j i 1 i 2 i 3 = j i j d out ri (i) 29

The Stationary Distribution Where is the surfer at time t+1? Follows a link uniformly at random pp tt + 11 = MM pp(tt) Suppose the random walk reaches a state pp tt + 11 = MM pp(tt) = pp(tt) then pp(tt) is called stationary distribution of a random walk Our original rank vector rr satisfies rr = MM rr So, rr is a stationary distribution for the random walk i 1 i 2 i 3 j p( t + 1) = M p( t) 30

Existence and Uniqueness A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0 Certain conditions: a walk starting from a random page can reach any other page, and the graph is not bipartite 31

What You Need to Know Motivation for link analysis Graphs are everywhere Web as graphs Pagerank: an important graph ranking algorithm A page is important if it is pointed to by other important pages Pagerank vector gives the stationary distribution for the random walk on a graph 32

Questions? 33