Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Similar documents
Link Analysis. Leonid E. Zhukov

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

1998: enter Link Analysis

Data Mining and Matrices

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

How does Google rank webpages?

IR: Information Retrieval

Online Social Networks and Media. Link Analysis and Web Search

Calculating Web Page Authority Using the PageRank Algorithm

ECEN 689 Special Topics in Data Science for Communications Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

1 Searching the World Wide Web

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

A Note on Google s PageRank

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

On the mathematical background of Google PageRank algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

eigenvalues, markov matrices, and the power method

Data and Algorithms of the Web

Computing PageRank using Power Extrapolation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Graph Models The PageRank Algorithm

Lecture 12: Link Analysis for Web Retrieval

Link Mining PageRank. From Stanford C246

Page rank computation HPC course project a.y

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

STA141C: Big Data & High Performance Statistical Computing

0.1 Naive formulation of PageRank

Google Page Rank Project Linear Algebra Summer 2012

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Introduction to Data Mining

Link Analysis. Stony Brook University CSE545, Fall 2016

The Google Markov Chain: convergence speed and eigenvalues

Application. Stochastic Matrices and PageRank

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Slides based on those in:

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Data Mining Recitation Notes Week 3

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Uncertainty and Randomization

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Markov Chains and Stochastic Sampling

The Second Eigenvalue of the Google Matrix

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Node and Link Analysis

Updating PageRank. Amy Langville Carl Meyer

CS6220: DATA MINING TECHNIQUES

Authoritative Sources in a Hyperlinked Enviroment

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Convex Optimization CMU-10725

Applications to network analysis: Eigenvector centrality indices Lecture notes

Markov Chains, Random Walks on Graphs, and the Laplacian

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Faloutsos, Tong ICDE, 2009

Inf 2B: Ranking Queries on the WWW

Krylov Subspace Methods to Calculate PageRank

CS47300: Web Information Search and Management

Math 304 Handout: Linear algebra, graphs, and networks.

Math 1553, Introduction to Linear Algebra

Introduction to Machine Learning CMU-10701

A hybrid reordered Arnoldi method to accelerate PageRank computations

CS6220: DATA MINING TECHNIQUES

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Complex Social System, Elections. Introduction to Network Analysis 1

Introduction to Information Retrieval

DOMINANT VECTORS APPLICATION TO INFORMATION EXTRACTION IN LARGE GRAPHS. Laure NINOVE OF NONNEGATIVE MATRICES

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Lecture 7 Mathematics behind Internet Search

Node Centrality and Ranking on Networks

PROBABILISTIC LATENT SEMANTIC ANALYSIS

The Static Absorbing Model for the Web a

CS249: ADVANCED DATA MINING

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Probability & Computing

Web Structure Mining Nodes, Links and Influence

Computational Economics and Finance

MATH36001 Perron Frobenius Theory 2015

Applications. Nonnegative Matrices: Ranking

1.3 Convergence of Regular Markov Chains

Transcription:

Introduction to Search Engine Technology Introduction to Link Structure Analysis Ronny Lempel Yahoo Labs, Haifa Outline Anchor-text indexing Mathematical Background Motivation for link structure analysis HITS Hypertext Induced Topic Search PageRank 20 October 2013 236375 Search Engine Technology 2 1

Navigational Queries & Anchor Text Anchor text the highlighted clickable text of a link Physically in page A, but actually describes the content of page B Many times, concisely defines pages B Anchor text is the dominant factor in ranking navigational queries Queries where the user has a specific home page in mind, and uses the search engine to navigate to that page While in principle there is a single correct result for a navigational query, the result may also depend on factors beyond the query text such as the locale of the user Sometimes (ab)used by a community to mislead search engines in a technique referred to as Googlebombing Famous example: miserable failure 20 October 2013 236375 Search Engine Technology 3 Retrieval Issues when Using Anchor Text Anchor stop words: home, back, click here Should intra-site anchor-text be considered as important as inter-site anchor-text? In conjunctive queries, users may be dissatisfied when results that do not physically contain the query terms are returned (and counted) Perhaps anchor text should only be used to enhance precision rather than determine recall? Anchor-only pages are pages for which the engine has encountered anchor-text but hasn t crawled the page itself 1. Are anchor-only pages eligible to be returned? 2. Can effective summaries be constructed for such pages? 20 October 2013 236375 Search Engine Technology 5 2

Link Analysis - Motivation Traditional IR text-based ranking methods are not sufficient on the Web, even when considering anchor text: The web is huge, with great variation in the quality of the pages Queries are usually very short, making differentiation between pages difficult The text on many web pages does not sufficiently describe the page Keyword spamming The emphasis of web search is on precision, not recall! 20 October 2013 236375 Search Engine Technology 6 Link Analysis - Motivation Beyond anchor-text: the connectivity patterns between Web pages contain a gold mine of information. A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b. 2. Evidence that pages a,b share some topic of interest. A co-citation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest. a c b However, not all links should be interpreted as such. 20 October 2013 236375 Search Engine Technology 7 3

(not) All Links are Created Equal The Good: informative links, in the sense described previously The Bad: non-informative links, such as: 1. Intra-domain, navigational links 2. Commercials, sponsorship links and the Ugly: 1. Link spamming 2. Non-semantic link exchanges 20 October 2013 236375 Search Engine Technology 8 Non-informative Link Types Navigational intradomain links. Commercial links, links to sponsors Non-semantic Linkexchange agreements 20 October 2013 236375 Search Engine Technology 9 4

Mathematical Background - Irreducibility A directed graph G=(V,E) is called irreducible if for every i,k V there is a path in G originating at i and ending in k A non-negative NxN matrix W is called irreducible if i,k {1,,N} there exists a non-negative integer m such that [W m ] i,k >0 The support graph G W ={V W,E W } of a non-negative NxN matrix W is a directed graph with N vertices, such that i k E W iff W i,k >0 Lemma: a non-negative square matrix W is irreducible if and only if G W is irreducible Lemma: a directed graph G is irreducible if and only if the adjacency matrix of G is irreducible 20 October 2013 236375 Search Engine Technology 10 Mathematical Background: Perron-Frobenius Theorem (simplified) Let W be a non-negative NxN matrix, and denote by λ1(w), λ2(w),...λn(w) the N eigenvalues of W, ordered by non-decreasing absolute value (i.e. λ1(w) λ2 (W)... λn(w) ) λ1(w) is the spectral radius of W and will simply be denoted by λ(w) Perron-Frobenius Theorem for irreducible matrices: let W be an irreducible matrix. Then: λ(w) > 0 λ(w) is a simple eigenvalue of W (i.e. it is a simple root of the characteristic polynomial of W) In particular, λ(w) is a real eigenvalue of W W has positive left and right eigenvectors corresponding to λ(w) (i.e. all the components in those eigenvectors are greater than zero) 20 October 2013 236375 Search Engine Technology 11 5

Mathematical Background: Implications of Perron-Frobenius Theory Lemma: let W be an irreducible NxN matrix. A sufficient condition that guarantees that λ1(w) > λ2 (W) is that for some index j, [W] j,j >0 (i.e. W has some non-zero element on its main diagonal) Corollary: let W be an irreducible NxN matrix for which λ1(w) > λ2 (W), and let v be a real eigenvector of W that doesn t correspond to λ1(w). Then v has both positive and negative entries. What we ve learned: let W be an irreducible NxN matrix with some non-zero element on its main diagonal. Then: λ(w)= λ1(w)> λ2(w) There is a unique positive unit eigenvector of W that corresponds to λ(w) denote it by v λ(w) Let e be any non-negative (but not zero) vector. Then, <e, v λ(w) > > 0 Any real eigenvector of W that doesn t correspond to λ(w) has both positive and negative entries. 20 October 2013 236375 Search Engine Technology 12 Mathematical Background: Stochastic and Primitive Matrices Definition: a non-negative NxN matrix P is stochastic if the sum of every row in P is 1 Definition: the period of a directed graph G is the greatest common divisor of the lengths of all the cycles in G G is called aperiodic if it has a period of 1 Definition: a non-negative NxN matrix M is called primitive if its support graph G M is aperiodic 20 October 2013 236375 Search Engine Technology 13 6

Mathematical Background: Ergodic Thoerem Ergodic Theorem: let P be an irreducible and primitive stochastic matrix λ(p) = λ1(p) = 1, and any other eigenvalue of λ* of P satisfies λ* <1 There is a unique distribution row-vector π which satisfies πp=π π is the principal eigenvector of P and is the stationary distribution of the Markov Chain defined by the transition matrix P For any distribution row-vector q, lim k q P k = π 20 October 2013 236375 Search Engine Technology 14 Mathematical Background: The Power Method The Power Method is an iterative algorithm to compute (under certain conditions) the eigenvalue of largest absolute value and the corresponding eigenvector of a general NxN matrix M The conditions: M should have N linearly independent eigenvectors λ1(m) > λ2 (M) One can choose an initial (unit) vector v 0 that is not orthogonal to the eigenvector that corresponds to λ1(m) The iteration: 1. v k+1 = Mv k 2. Scale v k+1 to be a unit vector As k, v k will approach the eigenvector corresponding to λ1(m) Note that this iteration, without the need for scaling, will also compute the principal eigenvector of a primitive and irreducible stochastic matrix 20 October 2013 236375 Search Engine Technology 15 7

Hubs and Authorities Notions proposed by J. Kleinberg in his 1998 paper Authoritative Sources in a Hyperlinked Environment Two types of quality Web pages that pertain to a given topic hubs Authorities: pages containing authoritative content on the topic of interest authorities Hubs: resource lists, containing links to many authorities. Hubs and authorities exhibit a mutually reinforcing relationship 20 October 2013 236375 Search Engine Technology 16 Authorities - Authoritative pages Here is an authority for the query movies : 20 October 2013 236375 Search Engine Technology 17 8

Hubs - Pages Which Link to Many Authorities Here is a hub for the query movies : 20 October 2013 236375 Search Engine Technology 18 Hypertext Induced Topic Search HITS: an algorithm proposed by Kleinberg to identify hubs and authorities, given a topic of interest t. 1. Identifies a subgraph of the Web where many hubs and authorities pertaining to t should exist. 2. Assigns each page s an authority weight a(s) and a hub weight h(s) s.t. the following Mutual Reinforcement properties holds: a(s) is proportional to Σ {x x s} h(x) h(s) is proportional to Σ {x s x} a(x) Such weights exist, and can be efficiently computed Based on the Perron-Frobenius theory of non-negative matrices. 20 October 2013 236375 Search Engine Technology 19 9

HITS: Assembling the Collection Given a query, use a term based search engine to compile the root set - an initial set of (somewhat on-topic) pages Expand the root set to its bidirectional neighborhood of radius 1 The sub-graph induced by this process will be denoted by G 20 October 2013 236375 Search Engine Technology 20 HITS: Analyzing the Graph G Assign each page s (=node in G) an authority weight a(s) and a hub weight h(s) according to the following iterative algorithm: a(s) 1, h(s) 1 for all pages s. Repeat the following three operations, until an equilibrium is reached (i.e. until convergence): The I operation: for all pages s, a(s) Σ {x x s} h(x) The O operation: for all pages s, h(s) Σ {x s x} a(x) Normalize the resulting weights vectors 20 October 2013 236375 Search Engine Technology 21 10

HITS: Matrix Notation Let W denote the adjacency matrix of G, a=(a(s1),,a(sn)) T, h=(h(s1),,h(sn)) T The I operation: a W T h. The O operation: h Wa. Hence (without normalizing): ak+1 W T (Wak)= (W T W)ak hk+1 W(W T hk)= (WW T )hk 20 October 2013 236375 Search Engine Technology 22 HITS Algebraic Interpretation Repeating the three-step iteration (I,O,Norm) is equivalent to applying the power method, which computes the principal eigenvector of a matrix limk ak = principal eigenvector of W T W limk hk = principal eigenvector of WW T The principal community of authorities is defined as the m pages with the highest authority weights in a The principle community of hubs is similarly defined in terms of h 20 October 2013 236375 Search Engine Technology 23 11

HITS: the Role of the Matrices W T W is the co-citation matrix of the collection [W T W] i,j is the number of pages that link to both i and j WW T is the bibliographic coupling matrix of the collection [WW T ] i,j is the number of pages to which both i and j link Both matrices are non-negative and symmetric Thus their eigenvalues are all real, and they have N linearly independent eigenvectors that span R N If W is non-zero, both matrices will have at least one non-zero element on their main diagonal If the matrices are irreducible, they both have a unique unit positive principal eigenvector, and all non-principal eigenvectors contain both positive and negative entries 20 October 2013 236375 Search Engine Technology 24 HITS: Non-Principal Communities Denote by λ1 > λ2... λn the eigenvalues of W T W(WW T ) and by ai(hi) the corresponding eigenvectors Weights derived from pairs (ai ; hi ) also exhibit the Mutual Reinforcement property When i>1, eigenvectors contain both positive and negative entries The non-principal communities are determined by a2,, am, h2,, hm (for some small m) Each non-principal eigenvector yields two communities, corresponding to the positive[negative] entries of highest absolute value Experiments show that in ambiguous/polarized topics, the polar communities of a non-principal eigenvector correspond to opposing/orthogonal aspects of the topic a2,, am, h2,, hm are computed by applying the power method with Gram-Schmidt steps 20 October 2013 236375 Search Engine Technology 25 12

PageRank (Brin & Page, 1998) Named after Google s co-founder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 20 October 2013 236375 Search Engine Technology 26 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), with probability d See next slide for discussion of pages that have no outlinks 2. Jumps to an arbitrary Web page (chosen uniformly at random), with probability 1-d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 20 October 2013 236375 Search Engine Technology 27 13

PageRank Handling Dangling Nodes PageRank as stated in the previous slide is not well defined with respect to exiting pages that have no outgoing links (dangling nodes) There are three accepted approaches for treating pages with no outgoing links: 1. Eliminate such pages from the graph (iteratively prune the graph until reaching a steady state) 2. Consider such pages to link back to the pages that link to them 3. Consider such pages to link to all web pages (effectively making an exit out of them equivalent to a random jump) 20 October 2013 236375 Search Engine Technology 28 PageRank: Steady State Equations The PageRanks obey the following equations: R(p) = (1-d)/N + d Σ R(j)/D(j) j I(p) R(p) The PageRank of page p. d A damping factor, 0 < d < 1 N Number of Web pages I(p) The set of pages that point to p D(j) Number of out-links (out-degree) of page j 20 October 2013 236375 Search Engine Technology 29 14

PageRank Algebraic Notation Let W denote the NxN adjacency matrix of the Web s link structure, after some form of handling dangling nodes Let W norm denote the matrix that results by dividing each row j of W by j s out-degree (row j s sum) Let T by an NxN matrix whose entries are all equal to 1/N Define M = (1-d)T + dw norm M is the transition matrix corresponding to PageRank s random walk M s principal eigenvector is the vector of PageRanks 20 October 2013 236375 Search Engine Technology 30 PageRank Alternative Interpretation Repeat the following: Sample a value L from a Geometric distribution with parameter 1-d, i.e. P(L) = (1-d)d L-1 Starting from a random Web page, walk a random path of length L-1, where each page is exited by one of its outlinks chosen uniformly at random Essentially, PageRank can be viewed as a weighted average of the number of paths leading to each node, where the average is weighted by: Length of the path (probability decreases geometrically) Each edge is traversed with probability inverse to the out-degree of the node from which it originates R(p) = (1-d)/N Σ d L-1 Σ Prob(paths v p of length L-1) L=1 v G 20 October 2013 236375 Search Engine Technology 31 15