CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Similar documents
Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Link Mining PageRank. From Stanford C246

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides based on those in:

Online Social Networks and Media. Link Analysis and Web Search

Introduction to Data Mining

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Data and Algorithms of the Web

Data Mining and Matrices

CS6220: DATA MINING TECHNIQUES

0.1 Naive formulation of PageRank

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Link Analysis. Stony Brook University CSE545, Fall 2016

CS6220: DATA MINING TECHNIQUES

Link Analysis Ranking

1998: enter Link Analysis

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Jeffrey D. Ullman Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Jeffrey D. Ullman Stanford University

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Link Analysis. Leonid E. Zhukov

Lecture 12: Link Analysis for Web Retrieval

Link Analysis & Ranking CS 224W

1 Searching the World Wide Web

ECEN 689 Special Topics in Data Science for Communications Networks

A Note on Google s PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

CS249: ADVANCED DATA MINING

IR: Information Retrieval

STA141C: Big Data & High Performance Statistical Computing

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Complex Social System, Elections. Introduction to Network Analysis 1

How does Google rank webpages?

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Inf 2B: Ranking Queries on the WWW

Graph Models The PageRank Algorithm

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Page rank computation HPC course project a.y

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Data Mining Recitation Notes Week 3

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Computing PageRank using Power Extrapolation

eigenvalues, markov matrices, and the power method

Node and Link Analysis

Google Page Rank Project Linear Algebra Summer 2012

Faloutsos, Tong ICDE, 2009

Degree Distribution: The case of Citation Networks

Updating PageRank. Amy Langville Carl Meyer

Lecture 7 Mathematics behind Internet Search

Authoritative Sources in a Hyperlinked Enviroment

Applications. Nonnegative Matrices: Ranking

Web Structure Mining Nodes, Links and Influence

Uncertainty and Randomization

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Math 304 Handout: Linear algebra, graphs, and networks.

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Machine Learning CPSC 340. Tutorial 12

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

The Theory behind PageRank

COMPSCI 514: Algorithms for Data Science

The Static Absorbing Model for the Web a

Node Centrality and Ranking on Networks

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Calculating Web Page Authority Using the PageRank Algorithm

Finding central nodes in large networks

Convex Optimization CMU-10725

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

Improving Diversity in Ranking using Absorbing Random Walks

Models and Algorithms for Complex Networks. Link Analysis Ranking

Link Analysis and Web Search

The Google Markov Chain: convergence speed and eigenvalues

The Second Eigenvalue of the Google Matrix

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

CPSC 540: Machine Learning

Centrality Measures and Link Analysis

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Transcription:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

SEARCH! Find relevant docs in a small and trusted set: Newspaper articles Patents, etc. Two traditional problems: Synonimy: buy purchase, sick ill Polysemi: jaguar 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

Does more documents mean better results? 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

What is best answer to query Stanford? Anchor Text: I go to Stanford where I study What about query newspaper? No single right answer Scarcity (IR) vs. abundance (Web) of information Web: Many sources of information. Who to trust? Trick: Pages that actually know about newspapers might all be pointing to many newspapers Ranking! 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

the golden triangle 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

Web pages are not equally important www.joe schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

We will cover the following Link Analysis approaches to computing importances of nodes in a graph: Hubs and Authorities (HITS) Page Rank Topic Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node u Degree dentrality = degree of u Betweenness centrality = #shortest paths passing through u Closeness centrality = avg. length of shortest paths from u to all other nodes Eigenvector centrality = like PageRank 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

Goal (back to the newspaper example): Don t just find newspapers.find experts people who link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of pages pointed to Quality as an content (authority): Total sum of votes of experts Principle of repeated improvement NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors h and a 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

[Kleinberg 98] Each page i has 2 scores: Authority score: Hub score: HITS algorithm: Initialize: Then keep iterating: Authority: Hub: normalize:, j 1 j 2 j 3 j 4 i i j 1 j 2 j 3 j 4 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

[Kleinberg 98] HITS converges to a single stable point Slightly change the notation: Vector a = (a 1,a n ), h = (h 1,h n ) Adjacency matrix (n x n): M ij =1 if i j Then: h i a j hi i j So: h Ma And likewise: a j M T M h ij a j 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

HITS algorithm in new notation: Set: a = h = 1 n Repeat: h=ma, a=m T h Normalize Then: a=m T (Ma) new h new a Thus, in 2k steps: a=(m T M) k a h=(m M T ) k h a is being updated (in 2 steps): M T (M a)=(m T M) a h is updated (in 2 steps): M(M T h)=(mm T ) h Repeated matrix powering 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

Definition: Let Ax= x for some scalar, vector x, matrix A Then x is an eigenvector, and is its eigenvalue Fact: If A is symmetric (A ij =A ji ) (in our case M T M and M M T are symmetric) Then A has n orthogonal unit eigenvectors w 1 w n that form a basis (coordinate system) with eigenvalues 1... n ( i i+1 ) 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

Let s write x in coordinate system w 1 w n x= i i w i x has coordinates ( 1,, n ) Suppose: 1... n ( 1 n ) A k x= k x = i ik i w i As k, if we normalize A k x 1 1 w 1 (contribution of all other coordinates 0) So authority a is eigenvector of M T M associated with largest eigenvalue 1 Similarly: hub h is eigenvector of M M T 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19 lim

A vote from an important page is worth more A page is important if it is pointed to by other important pages Define a rank r j for node j r j i j r i d out (i) a/2 a The web in 1839 a/2 y/2 y y/2 m Flow equations: r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 m 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Stochastic adjacency matrix M Let page j has d j out links If j i, then M ij = 1/d j else M ij = 0 M is a column stochastic matrix Columns sum to 1 Rank vector r: vector with an entry per page r i is the importance score of page i i r i = 1 The flow equations can be written r = M r i j 1 3 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

Imagine a random web surfer: At any time t, surfer is on some page u At time t+1, the surfer follows an out link from u uniformly at random Ends up on some page v linked from u Process repeats indefinitely Let: p(t) vector whose i th coordinate is the prob. that the surfer is at page i at time t p(t) is a probability distribution over pages 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state p(t+1) = Mp(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = Mr So, it is a stationary distribution for the random walk 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Assign each node an initial page rank Repeat until convergence calculate the page rank of each node r ( t 1) j i j r ( t) i d i d i. out-degree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

Power Iteration: Set And iterate Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

r ( t 1) j i j r ( t) i d i or equivalently r Mr Does this converge? Does it converge to what we want? Are results reasonable? 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

a b Example: r a 1 0 1 0 = r b 0 1 0 1 Iteration 0, 1, 2, 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

a b Example: r a 1 0 0 0 = r b 0 1 0 0 Iteration 0, 1, 2, 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

2 problems: Some pages are dead ends (have no out links) Such pages cause importance to leak out Spider traps (all out links are within the group) Eventually spider traps absorb all importance 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

Power Iteration: Set And iterate Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

Power Iteration: y y a m y ½ ½ 0 Set And iterate a m a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 Example: r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, r m = r a /2 + r m 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

r Mr ( t 1) ( t ) Markov Chains Set of states X Transition matrix P where P ij = P(X t =i X t 1 =j) π specifying the probability of being at each state x X Goal is to find π such that π = π P 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

Markov chains theory Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33

Stochastic: every column sums to 1 S M 1 a( e) n e vector of all 1 a y m y a m y ½ ½ 1/3 a ½ 0 1/3 m 0 ½ 1/3 r y = r y /2 + r a /2 + r m /3 r a = r y /2+ r m /3 r m = r a /2 + r m /3 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34

A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. y a m 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35

From any state, there is a non zero probability of going from any one state to any another. y a m 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36

Google s solution: At each step, random surfer has two options: With probability 1-, follow a link at random With probability, jump to some page uniformly at random PageRank equation [Brin Page, 98] r= Assuming we follow random teleport links with probability 1.0 from dead-ends d i outdegree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37

The Google Matrix: G is stochastic, aperiodic and irreducible. G is dense but computable using sparse mtx H 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38

PageRank as a principal eigenvector r = Mr r j = i r i /d i But we really want: r j = (1- ) i j r i /d i + Define: M ij = (1- ) M ij + 1/n Then: r = M r What is? In practice =0.15 (5 links and jump) d i out degree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39

11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40

PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41

Goal: Evaluate pages not just by popularity but by how close they are to the topic Teleporting can go to: Any page with equal probability (we used this so far) A topic specific set of relevant pages Topic specific (personalized) PageRank M ij = (1- ) M ij + / S if i in S (S...teleport set) =(1- ) M ij otherwise 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43

Graphs and web search: Ranks nodes by importance Personalized PageRank: Ranks proximity of nodes to the teleport nodes S Proximity on graphs: Q: What is most related conference to ICDM? Random Walks with Restarts Teleport back: S={single node} IJCAI KDD ICDM SDM AAAI NIPS Conference Philip S. Yu Ning Zhong R. Ramakrishnan M. Jordan Author 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44

Link Farms: networks of millions of pages design to focus PageRank on a few undeserving webpages To minimize their influence use a teleport set of trusted webpages E.g., homepages of universities 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45

http://oak.cs.ucla.edu/~cho/papers/cho-bias.pdf Rich get richer [Cho et al., WWW 04] Two snapshots of the web graph at two different time points Measure the change: In the number of in links PageRank 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47

11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48