Lecture 12: Link Analysis for Web Retrieval

Similar documents
Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Online Social Networks and Media. Link Analysis and Web Search

Data Mining and Matrices

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

Link Mining PageRank. From Stanford C246

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Link Analysis. Leonid E. Zhukov

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

IR: Information Retrieval

The Static Absorbing Model for the Web a

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Link Analysis. Stony Brook University CSE545, Fall 2016

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

0.1 Naive formulation of PageRank

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Introduction to Information Retrieval

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Slides based on those in:

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

1998: enter Link Analysis

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Web Structure Mining Nodes, Links and Influence

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

PV211: Introduction to Information Retrieval

Data Mining Recitation Notes Week 3

Jeffrey D. Ullman Stanford University

Page rank computation HPC course project a.y

A Note on Google s PageRank

Introduction to Data Mining

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Google Page Rank Project Linear Algebra Summer 2012

Graph Models The PageRank Algorithm

CPSC 540: Machine Learning

Computing PageRank using Power Extrapolation

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

STA141C: Big Data & High Performance Statistical Computing

The Dynamic Absorbing Model for the Web

Approximate Inference

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Jeffrey D. Ullman Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CPSC 540: Machine Learning

1 Searching the World Wide Web

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Data and Algorithms of the Web

How does Google rank webpages?

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

ECEN 689 Special Topics in Data Science for Communications Networks

Uncertainty and Randomization

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

The Theory behind PageRank

CS6220: DATA MINING TECHNIQUES

Math 304 Handout: Linear algebra, graphs, and networks.

Entropy Rate of Stochastic Processes

Discrete time Markov chains. Discrete Time Markov Chains, Limiting. Limiting Distribution and Classification. Regular Transition Probability Matrices

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

The Second Eigenvalue of the Google Matrix

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

CS 188: Artificial Intelligence Spring 2009

Updating PageRank. Amy Langville Carl Meyer

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

On the mathematical background of Google PageRank algorithm

CS 188: Artificial Intelligence Fall Recap: Inference Example

CS 188: Artificial Intelligence

CS6220: DATA MINING TECHNIQUES

eigenvalues, markov matrices, and the power method

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Link Analysis and Web Search

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Calculating Web Page Authority Using the PageRank Algorithm

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

CS 343: Artificial Intelligence

Node Centrality and Ranking on Networks

COMPSCI 514: Algorithms for Data Science

CS249: ADVANCED DATA MINING

CS 188: Artificial Intelligence Spring Announcements

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

CS 5522: Artificial Intelligence II

Transcription:

Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1

What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities method

Up until now... Documents assumed to be equal Usefulness for ranking only affected by matching of query terms (and length) E.g., assumed P(R d) uniform in probabilistic methods Can we do better than this?... some documents are authoratitive and should be ranked higher than others.

The web as a graph Pages on the Web do not standalone Treatment as independent documents is over-simplification Considerable information in hyperlink structure Paraguay jaguar Mexico lion predator tiger

What information does a hyperlink convey? Directs user s attention to other pages Conferral of authority (not always!) Anchor text to explain why linked page is of interest IBM computers links to www.ibm.com search portal links to www.yahoo.com click here links to Adobe Acrobat evil empire links to... Additional source of terms for indexing Perhaps the most important pages have more incoming links?

The web as a directed graph Formally, consider in-links Number of incoming edges out-links Number of outgoing edges connected components Path connects all pairs of nodes Paraguay jaguar Mexico lion predator tiger

Not all links are equal Who and what to trust? outgoing links from reputable sites should carry more weight than user-generated content and links from unknown websites Web has bow-tie structure, comprising in pages that only have outgoing edges to strongly connected component whose pages are highly interlinked, and also link to out pages that only have incoming edges Typically don t consider internal links within a web-site (why?)

Page rank Assumptions links convey authority of the source page pages with more in links from authorative sources are more important how to formalise this in a model? Random web surfer Consider a surfer who visits a web page then follows a random out link, uniformly occasionally teleports to a new random page (types a new URL) Inference problem: what happens over time? Where does the surfer end up visiting most often?

Example graph 1 2 3 Transition probabilities (no teleport for now) P(1 1) = 0 P(1 2) = 1 P(1 3) = 0 P(2 1) = 1 2 P(2 2) = 0 P(2 3) = 1 2 P(3 1) = 0 P(3 2) = 1 P(3 3) = 0 Example from MRS, Chapter 21.

Example graph 1 2 3 Represent as matrix, P ij = P(i j). I.e., 0 1 0 P = 1 1 2 0 2 0 1 0 Note that P is the adjacency matrix A ij = edge exists i j, normalised such that each row sums to 1.

Adding Teleportation If at each time step we randomly jump to another node in the graph with probability α scale our original P matrix by 1 α add α N to the resulting matrix Overall our transition matrix is P ij = (1 α) A ij j A ij + α 1 N For the example with α = 0.5 P = 1 6 5 12 1 6 2 3 1 6 2 3 1 6 5 12 1 6

Markov chain Formally we have defined a Markov chain a discrete time stochastic process consists of N states, one per web page, denoted x (a row vector) starting at page i, then use 1-hot representation i.e., x i = 1 and x j = 0, j i Assumptions prob of reaching a state is based only on previous state p(x t x 1, x 2,..., x t 1 ) = p(x t x t 1 ) this is characterised by the transition matrix

Transitions as matrix multiplication Probabilty chain rule can be expressed using matrix multiplication xp = [ P(1) P(2) P(3) ] P(1 1) P(1 2) P(1 3) P(2 1) P(2 2) P(2 3) P(3 1) P(3 2) P(3 3) P(1)P(1 1) + P(2)P(2 1) + P(3)P(3 1) = P(1)P(1 2) + P(2)P(2 2) + P(3)P(3 2) P(1)P(1 3) + P(2)P(2 3) + P(3)P(3 4) = [P(X t+1 = 1) P(X t+1 = 2) P(X t+1 = 3)] where P(1) = P(X t = 1) and P(2 3) = P(X t+1 = 3 X t = 2).

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0]

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167]

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083]

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]... xp 99 = [0.2778 0.4444 0.2778]

Example start at state x after one time-step, probability of next state xp after two time-steps, now ( xp)p = xp 2... Example x = [0 1 0] xp = [0.4167 0.1667 0.4167] xp 2 = [0.2083 0.5833 0.2083] xp 3 = [0.3125 0.3750 0.3125]... xp 99 = [0.2778 0.4444 0.2778] xp 100 = π = [0.2778 0.4444 0.2778]

Markov chain convergence Run sufficiently long, state membership converges reaches a steady-state, denoted π transitions from this state leave the state unmodified state frequencies encode frequency of visiting each page for random surfer in the limit as t When will the Markov Chain converge? Must have the property of ergodicity For any start state i, all states j must be reachable with non-zero probability for all t > T 0, for constant T 0. Ergodicity in turn requires irreducibility (reachability between i and j) and aperiodicity (relating to partioning into sets with internal cycles).

Computing PageRank Definition of a steady-state is πp = π I.e., once in steady-state, remain in this state after transition This is a classic linear algebra problem (finding the left eigenvalues), of the form πp = λ π Can recover several solution vectors for different values of λ We want the principle eigenvector, for which λ = 1 In practice, may use the power iteration method to handle large graphs

Iteration method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> pi0= [1/3 1/3 1/3]; >> pi0 * P ans = 0.2500 0.5000 0.2500 >> pi1 = pi0 * P pi1 = 0.2500 0.5000 0.2500 >> pi2 = pi1 * P pi2 = 0.2917 0.4167 0.2917 >> pi3 = pi2 * P pi3 = 0.2708 0.4583 0.2708 >> pi4 = pi3 * P pi4 = 0.2812 0.4375 0.2812

Eigenvalue method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> [V,D,W] = eig(p); >> pi = W(:,1); >> pi = pi/sum(pi); >> pi ans = 0.2778 0.4444 0.2778 >> pi * P ans = 0.2778 0.4444 0.2778

Hubs and authorities (HITS) Assumes there are two kinds of pages on web authorities providing authoratitive and detailed information Australian Tax Office Bureau of Meteorology Wikipedia page on Australian Cricket Team hubs containing mainly links to lots of pages about a topic DMOZ web directories Someone s Pinterest page Wikipedia disambiguation pages Depending on our query, we might want one or the other broad topic query, e.g., information about biking in Melbourne specific query, e.g., is the Yarra trail sealed?

Hubs and authorities Circular definition A good Hub links to many authorities A good Authority is linked to from many hubs. Define h hub scores for each web page a authority scores for each web page Mutually recursive definition h i i j a j a i j i h i for all pages i. Gives rise to iterative algorithm for finding h and a.

Computing Hubs and Authorities Define A adjacency matrix, as before A ij = 1 denotes edge i j Leads to the relations h A a a A h Combining the definitions for h into a, a A A a which is another eigenvalue problem. The principle eigenvalue of A A can be used to solve for a, provided there is a steady-state solution (find h in similar way).

Summary: PageRank and Hubs and Authorities Both static query-independent measures of web page quality Can be run offline to score each web page Based on latent (unobserved) quality metric for each page single importance score hub and authority scores Plus transition mechanism Gives rise to document rankings

PageRank and HITS in a retrieval system How can we use these scores in a retreival system? Alongside our other features, e.g., TF*IDF, BM25 factors, LM Express model as a combination of factors, e.g., RSV d = I α i h i ( f :, f d,:,...) i=1 PR/H&A become additional features, each with their own weight α Learn weighting using machine learned scoring function to match binary relevance judgements based on click throughs or query reformulations These methods can be exploited, e.g., link spam, Google bombs, Google bowling etc.

Looking back and forward Back Link structure of the web gives rise to a graph Structure of the graph conveys information about importance of pages PageRank models random surfer using Markov Chain HITS models hubs versus authority nodes Solve for steady-state using power iteration or eigenvalue solver

Looking back and forward Forward Natural language processing, looking in more detail into the structure of text Starting with text classification

Further reading (Review of Eigen decompositions) 18.1, Linear algebra review of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. Chapter 21, Link Analysis of Manning, Raghavan, and Schutze, Introduction to Information Retrieval.