CS47300: Web Information Search and Management

Similar documents
CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis. Leonid E. Zhukov

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS47300: Web Information Search and Management

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Online Social Networks and Media. Link Analysis and Web Search

A Note on Google s PageRank

Data Mining and Matrices

Link Analysis Ranking

Node Centrality and Ranking on Networks

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Google Page Rank Project Linear Algebra Summer 2012

0.1 Naive formulation of PageRank

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Inf 2B: Ranking Queries on the WWW

Data and Algorithms of the Web

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Complex Social System, Elections. Introduction to Network Analysis 1

Online Social Networks and Media. Link Analysis and Web Search

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

How does Google rank webpages?

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Link Mining PageRank. From Stanford C246

1998: enter Link Analysis

Applications to network analysis: Eigenvector centrality indices Lecture notes

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Slides based on those in:

IR: Information Retrieval

Introduction to Data Mining

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Web Structure Mining Nodes, Links and Influence

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

The Second Eigenvalue of the Google Matrix

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

Link Analysis. Stony Brook University CSE545, Fall 2016

Cross-lingual and temporal Wikipedia analysis

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Computing PageRank using Power Extrapolation

Graph Models The PageRank Algorithm

Lecture 12: Link Analysis for Web Retrieval

Data Mining Recitation Notes Week 3

Uncertainty and Randomization

eigenvalues, markov matrices, and the power method

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Ranking on Large-Scale Graphs with Rich Metadata

1 Searching the World Wide Web

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS249: ADVANCED DATA MINING

DOMINANT VECTORS APPLICATION TO INFORMATION EXTRACTION IN LARGE GRAPHS. Laure NINOVE OF NONNEGATIVE MATRICES

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

The Dynamic Absorbing Model for the Web

Power Laws & Rich Get Richer

Degree Distribution: The case of Citation Networks

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Link Analysis and Web Search

Lecture 7 Mathematics behind Internet Search

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Mathematical Properties & Analysis of Google s PageRank

The Static Absorbing Model for the Web a

Faloutsos, Tong ICDE, 2009

Analysis of Google s PageRank

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Data Mining Techniques

Jeffrey D. Ullman Stanford University

Node and Link Analysis

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

STA141C: Big Data & High Performance Statistical Computing

Jeffrey D. Ullman Stanford University

Finding central nodes in large networks

Models and Algorithms for Complex Networks. Link Analysis Ranking

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Eigenvalues of Exponentiated Adjacency Matrices

Randomization and Gossiping in Techno-Social Networks

arxiv:cond-mat/ v1 3 Sep 2004

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Page rank computation HPC course project a.y

Analysis and Computation of Google s PageRank

Transcription:

CS473: Web Information Search and Management Using Graph Structure for Retrieval Prof. Chris Clifton 24 September 218 Material adapted from slides created by Dr. Rong Jin (formerly Michigan State, now at Alibaba) Ad-Hoc Retrieval: Beyond the Words Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 14 Jan-18 2 Christopher W. Clifton 1

Citation Analysis 15 Ad-Hoc Retrieval: Beyond the Words Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge What does it look like? Web is small world The diameter of the web is 19 e.g. the average number of clicks from one web site to another is 19 16 Jan-18 2 Christopher W. Clifton 2

Bowtie Structure Sites Strongly that link Connected towards the Sites that link from the center Component of the web center of the web Broder et al., 21 17 Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law Broder et al., 21 21 Jan-18 2 Christopher W. Clifton 3

So what good is Link Structure? When you can t find something, do you: Keep looking in the same place? Look somewhere else? Give up? Ask for help? Other people may already know the answer! Links: Reflect human udgement 22 Early Approaches Basic Assumptions Hyperlinks contain information about the human udgment of a site The more incoming links to a site, the more it is udged important Bray 1996 The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites 24 Jan-18 2 Christopher W. Clifton 4

HITS - Kleinberg s Algorithm HITS Hypertext Induced Topic Selection For each vertex v ϵ V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites 26 Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) 27 Jan-18 2 Christopher W. Clifton 5

Authority and Hubness: Version 1 Recursive dependency a( v) h( w) w pa[ v] h( v) a( w) wch [ v] HubsAuthorities(G) V 1 1 [1,,1] Є R 2 a h 1 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) t w Є pa[v] t -1 7 h (v) Σ a (w) t w Є pa[v] 8 t t + 1 t -1 9 until a a + h h < ε t t -1 t t -1 1 return (a, h ) t t Problems? 28 Authority and Hubness: Version 2 Recursive dependency a( v) h( w) w pa[ v] h( v) a( w) wch [ v] + Normalization av () av () aw ( ) hv () w hv () hw ( ) w HubsAuthorities(G) 1 1 V [1,,1] Є R 2 a h 1 3 t 1 4 repeat 5 for each v in V 6 do a t (v) Σ h (w) w Є pa[v] t -1 7 h t (v) Σ w Є pa[v] a (w) t -1 8 a t a t / a 9 h t h t / h 1 t t + 1 11 until a t a t -1 + h t h t -1 < ε 12 return (a t, h t) 29 Jan-18 2 Christopher W. Clifton 6

HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 Authority and hubness weights 3 Authority and Hubness Authority score Not only depends on the number of incoming links But also the quality (e.g., hubness) of the incoming links Hubness score Not only depends on the number of outgoing links But also the quality (e.g., hubness) of the outgoing links 31 Jan-18 2 Christopher W. Clifton 7

Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, 1 the ith site points to the th site otherwise M = 32 Authority and Hub Vector a: a i is the authority score for the i-th site Vector h: h i is the hub score for the i-th site Matrix M: M i, 1 the ith site points to the th site otherwise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] 33 Jan-18 2 Christopher W. Clifton 8

Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, 1 the ith site points to the th site otherwise Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] T a M h h Ma 34 Authority and Hub Column vector a: a i is the authority score for the i-th site Column vector h: h i is the hub score for the i-th site Matrix M: M i, Recursive dependency: 1 the ith site points to the th site otherwise Normalization Procedure a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] a T t tm ht h Ma t t t 35 Jan-18 2 Christopher W. Clifton 9

Authority and Hub Apply SVD to matrix M T T t tm h t at t tm Mat h T t tmat ht t tmm ht a T MUΣV u v a u1, h v1 i T i i i 36 PageRank Introduced by Page et al. (1998) The weight is assigned by the rank of parents Difference from HITS HITS separates Hubness & Authority weights Page rank is proportional to its parents rank, but inversely proportional to its parents outdegree Jan-18 2 Christopher W. Clifton 1

Matrix Notation M i, 1 the ith site points to the th site otherwise B i, 1 M i, i, otherwise M M = B = 4 Matrix Notation : represents the rank score for the i-th web page r r i r v = α w pa v r w ch w r = α B T r α : eigenvalue r : eigenvector of B Finding Pagerank find principle eigenvector of B 41 Jan-18 2 Christopher W. Clifton 11

Matrix Notation 42 Random Walk Model Consider a random walk through the Web graph?? B =??? 43 Jan-18 2 Christopher W. Clifton 12

Random Walk Model Consider a random walk through the Web graph B = 44 Random Walk Model Consider a random walk through the Web graph B = As T, what is portion of time that the web surfer will spend at each page? 45 Jan-18 2 Christopher W. Clifton 13

Random Walk Model Consider a random walk through the Web graph B = pk ( ) : percentage of time that the surfer will stay at the i-th site p k ( ) p( i) Bik, i T p B p 46 Adding Self Loop Allow surfer to decide to stay on the same place B = B' B (1 ) I 47 Jan-18 2 Christopher W. Clifton 14

Problem Rank Sink Problem Many Web pages have no inlinks Results in dangling edges in the graph B = 1 r(new page) = 48 Problem Rank Sink Problem Many Web pages have no outlinks Results in dangling edges in the graph B = 1 r(new page) = 1 49 Jan-18 2 Christopher W. Clifton 15

Distribution of the Mixture Model Hi, 1/ n B ' H (1 ) B r B r ' T Prevents the page ranks from being or 1 5 Stability Are link analysis algorithms based on eigenvectors stable? Will small changes in graph result in maor changes in outcomes? What if the connectivity of a portion of the graph is changed arbitrarily? How will this affect the results of algorithms? 51 Jan-18 2 Christopher W. Clifton 16

Stability of HITS Ng et al (21) A bound on the number of hyperlinks k that can be added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector k d + αδ 4 + 2α d 2 a a 2 α δ: eigengap λ1 λ2 d: maximum outdegree of G 52 r r Stability of PageRank 2 σ V r ε Ng et al (21) V: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small rank, the overall change will also be small 53 Jan-18 2 Christopher W. Clifton 17