CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Similar documents
CS47300: Web Information Search and Management

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis Ranking

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis. Leonid E. Zhukov

Data Mining and Matrices

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

IR: Information Retrieval

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Online Social Networks and Media. Link Analysis and Web Search

How does Google rank webpages?

CS6220: DATA MINING TECHNIQUES

ECEN 689 Special Topics in Data Science for Communications Networks

CS6220: DATA MINING TECHNIQUES

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

1998: enter Link Analysis

Data Mining Recitation Notes Week 3

Page rank computation HPC course project a.y

0.1 Naive formulation of PageRank

A Note on Google s PageRank

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS249: ADVANCED DATA MINING

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Computing PageRank using Power Extrapolation

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Google Page Rank Project Linear Algebra Summer 2012

Lecture 12: Link Analysis for Web Retrieval

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

STA141C: Big Data & High Performance Statistical Computing

Link Analysis. Stony Brook University CSE545, Fall 2016

Data and Algorithms of the Web

Slides based on those in:

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Calculating Web Page Authority Using the PageRank Algorithm

Web Structure Mining Nodes, Links and Influence

1 Searching the World Wide Web

Link Mining PageRank. From Stanford C246

Complex Social System, Elections. Introduction to Network Analysis 1

Inf 2B: Ranking Queries on the WWW

Math 304 Handout: Linear algebra, graphs, and networks.

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

eigenvalues, markov matrices, and the power method

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Uncertainty and Randomization

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

The Second Eigenvalue of the Google Matrix

Algebraic Representation of Networks

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

The Google Markov Chain: convergence speed and eigenvalues

Updating PageRank. Amy Langville Carl Meyer

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Degree Distribution: The case of Citation Networks

Application. Stochastic Matrices and PageRank

Graph Models The PageRank Algorithm

Cross-lingual and temporal Wikipedia analysis

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Applications to network analysis: Eigenvector centrality indices Lecture notes

Lecture: Local Spectral Methods (1 of 4)

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Lecture 7 Mathematics behind Internet Search

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Faloutsos, Tong ICDE, 2009

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

SUPPLEMENTARY MATERIALS TO THE PAPER: ON THE LIMITING BEHAVIOR OF PARAMETER-DEPENDENT NETWORK CENTRALITY MEASURES

Finding central nodes in large networks

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

DOMINANT VECTORS APPLICATION TO INFORMATION EXTRACTION IN LARGE GRAPHS. Laure NINOVE OF NONNEGATIVE MATRICES

CPSC 540: Machine Learning

CPSC 540: Machine Learning

Links between Kleinberg s hubs and authorities, correspondence analysis, and Markov chains

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Idea: Select and rank nodes w.r.t. their relevance or interestingness in large networks.

Node and Link Analysis

MPageRank: The Stability of Web Graph

Mathematical Properties & Analysis of Google s PageRank

Data Mining and Analysis: Fundamental Concepts and Algorithms

Node Centrality and Ranking on Networks

The Static Absorbing Model for the Web a

Some relationships between Kleinberg s hubs and authorities, correspondence analysis, and the Salsa algorithm

Introduction to Data Mining

Kristina Lerman USC Information Sciences Institute

Applications. Nonnegative Matrices: Ranking

Spectral Graph Theory Tools. Analysis of Complex Networks

Data Mining Techniques

DS504/CS586: Big Data Analytics Graph Mining II

Transcription:

CS 277: Data Mining Mining Web Link Structure

Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes, up to 4 slides Powerpoint or PDF is fine Needs to be emailed by 12 noon on the day of presentation Order of presentations will be announced later in the week

Web Mining Web = a potentially enormous data set for data mining 3 primary aspects of Web mining 1. Web page content e.g., clustering Web pages based on their text content 2. Web connectivity e.g., characterizing distributions on path lengths between pages e.g., determining importance of pages from graph structure 3. Web usage e.g., understanding user behavior from Web logs All 3 are interconnected/interdependent E.g., Google (and most search engines) use both content and connectivity These slides: Web connectivity

The Web Graph G = (V, E) V = set of all Web pages E = set of all hyperlinks Number of nodes? Difficult to estimate Crawling the Web is highly non-trivial > 10 billion Number of edges? E = O( V ) i.e., mean number of outlinks per page is a small constant

The Web Graph The Web graph is inherently dynamic nodes and edges are continually appearing and disappearing Interested in general properties of the Web graph What is the distribution of the number of in-links and out-links? What is the distribution of number of pages per site? Typically power-laws for many of these distributions How far apart are 2 randomly selected pages on the Web? What is the average distance between 2 random pages? And so on

Social Networks Social networks = graphs V = set of actors (e.g., students in a class) E = set of interactions (e.g., collaborations) Typically small graphs, e.g., V = 10 or 50 Long history of social network analysis (e.g. at UCI) Quantitative data analysis techniques that can automatically extract structure or information from graphs E.g., who is the most important actor in a network? E.g., are there clusters in the network? Comprehensive reference: S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, 1994.

Node Importance in Social Networks General idea is that some nodes are more important than others in terms of the structure of the graph In a directed graph, in-degree may be a useful indicator of importance e.g., for a citation network among authors (or papers) in-degree is the number of citations => importance However: in-degree is only a first-order measure in that it implicitly assumes that all edges are of equal importance

Recursive Notions of Node Importance w ij = weight of link from node i to node j assume Σ j w ij = 1 and weights are non-negative e.g., default choice: w ij = 1/outdegree(i) more outlinks => less importance attached to each Define r j = importance of node j in a directed graph r j = Σ i w ij r i i,j = 1,.n Importance of a node is a weighted sum of the importance of nodes that point to it Makes intuitive sense Leads to a set of recursive linear equations

Simple Example 1 2 3 4

Simple Example 1 1 2 0.5 3 0.5 0.5 0.5 0.5 0.5 4

Simple Example 1 1 2 0.5 3 0.5 0.5 0.5 0.5 Weight matrix W 0 1 0 0 0.5 0 0 0.5 0 0.5 0 0.5 0 0.5 0.5 0 4 0.5

Matrix-Vector form Recall r j = importance of node j r j = Σ i w ij r i i,j = 1,.n e.g., r 2 = 1 r 1 + 0 r 2 + 0.5 r 3 + 0.5 r 4 = dot product of r vector with column 2 of W Let r = n x 1 vector of importance values for the n nodes Let W = n x n matrix of link weights => we can rewrite the importance equations as r = W T r

Eigenvector Formulation Need to solve the importance equations for unknown r, with known W r = W T r We recognize this as a standard eigenvalue problem, i.e., A r = λ r (where A = W T) with λ = an eigenvalue = 1 and r = the eigenvector corresponding to λ = 1

Need to solve for r in Eigenvector Formulation (W T λ I) r = 0 Note: W is a stochastic matrix, i.e., rows are non-negative and sum to 1 Results from linear algebra tell us that: (a) Since W is a stochastic matrix, W and W T have the same eigenvectors/eigenvalues (b) The largest of these eigenvalues λ is always 1 (c) the vector r corresponds to the eigenvector corresponding to the largest eigenvector of W (or W T )

Solution for the Simple Example Solving for the eigenvector of W we get r = [0.2 0.4 0.133 0.2667] 1 Results are quite intuitive, e.g., 2 is most important 1 2 0.5 3 W 0 1 0 0 0.5 0 0 0.5 0 0.5 0 0.5 0 0.5 0.5 0 0.5 0.5 0.5 4 0.5 0.5

PageRank Algorithm: Applying this idea to the Web 1. Crawl the Web to get nodes (pages) and links (hyperlinks) [highly non-trivial problem!] 2. Weights from each page = 1/(# of outlinks) 3. Solve for the eigenvector r (for λ = 1) of the weight matrix Computational Problem: Solving an eigenvector equation scales as O(n 3 ) For the entire Web graph n > 10 billion (!!) So direct solution is not feasible Can use the power method (iterative) r (k+1) = W T r (k) for k=1,2,..

Power Method for solving for r r (k+1) = W T r (k) Define a suitable starting vector r (1) e.g., all entries 1/n, or all entries = indegree(node)/ E, etc Each iteration is matrix-vector multiplication =>O(n 2 ) - problematic? no: since W is highly sparse (Web pages have limited outdegree), each iteration is effectively O(n) For sparse W, the iterations typically converge quite quickly: - rate of convergence depends on the spectral gap -> how quickly does error(k) = (λ 2 / λ 1 ) k go to 0 as a function of k? -> if λ 2 is close to 1 (= λ 1 ) then convergence is slow - empirically: Web graph with 300 million pages -> 50 iterations to convergence (Brin and Page, 1998)

Basic Principles of Markov Chains Discrete-time finite-state first-order Markov chain, K states Transition matrix A = K x K matrix Entry a ij = P( state t = j state t-1 = i), i, j = 1, K Rows sum to 1 (since Σ j P( state t = j state t-1 = i) = 1) Note that P(state..) only depends on state t-1 P 0 = initial state probability = P(state 0 = i), i = 1, K

Simple Example of a Markov Chain K = 3 0.8 0.9 A = 0.8 0.2 0.0 0.0 0.9 0.1 0.2 0.2 0.6 1 0.2 2 0.1 0.2 0.2 P 0 = [1/3 1/3 1/3] 3 0.6

Steady-State (Equilibrium) Distribution for a Markov Chain Irreducibility: A Markov chain is irreducible if there is a directed path from any node to any other node Steady-state distribution π for an irreducible Markov chain*: π i = probability that in the long run, chain is in state I The π s are solutions to π = A t π Note that this is exactly the same as our earlier recursive equations for node importance in a graph! *Note: technically, for a meaningful solution to exist for π, A must be both irreducible and aperiodic

Markov Chain Interpretation of PageRank W is a stochastic matrix (rows sum to 1) by definition can interpret W as defining the transition probabilities in a Markov chain w ij = probability of transitioning from node i to node j Markov chain interpretation: r = W T r -> these are the solutions of the steady-state probabilities for a Markov chain page importance steady-state Markov probabilities eigenvector

The Random Surfer Interpretation Recall that for the Web model, we set w ij = 1/outdegree(i) Thus, in using W for computing importance of Web pages, this is equivalent to a model where: We have a random surfer who surfs the Web for an infinitely long time At each page the surfer randomly selects an outlink to the next page importance of a page = fraction of visits the surfer makes to that page this is intuitive: pages that have better connectivity will be visited more often

Potential Problems 1 2 3 Page 1 is a sink (no outlink) Pages 3 and 4 are also sinks (no outlink from the system) Markov chain theory tells us that no steady-state solution exists - depending on where you start you will end up at 1 or {3, 4} Markov chain is reducible 4

Making the Web Graph Irreducible One simple solution to our problem is to modify the Markov chain: With probability α the random surfer jumps to any random page in the system (with probability of 1/n, conditioned on such a jump) With probability 1-α the random surfer selects an outlink (randomly from the set of available outlinks) The resulting transition graph is fully connected => Markov system is irreducible => steady-state solutions exist Typically α is chosen to be between 0.1 and 0.2 in practice But now the graph is dense! However, power iterations can be written as: r (k+1) = (1- α) W T r (k) + (α/n) 1 T Complexity is still O(n) per iteration for sparse W

The PageRank Algorithm S. Brin and L. Page, The anatomy of a large-scale hypertextual search engine, in Proceedings of the 7 th WWW Conference, 1998. PageRank = the method on the previous slide, applied to the entire Web graph Crawl the Web Store both connectivity and content Calculate (off-line) the pagerank r for each Web page using the power iteration method How can this be used to answer Web queries: Terms in the search query are used to limit the set of pages of possible interest Pages are then ordered for the user via precomputed pageranks The Google search engine combines r with text-based measures This was the first demonstration that link information could be used for content-based search on the Web

Link Structure helps in Web Search Singhal and Kaszkiel, WWW Conference, 2001 SE1, etc, indicate different (anonymized) commercial search engines, all using link structure (e.g., PageRank) in their rankings TFIDF is a state-of-the-art search method (at the time) that does not use any link structure

PageRank architecture at Google Ranking of pages more important than exact values of p i Pre-compute and store the PageRank of each page. PageRank independent of any query or textual content. Ranking scheme combines PageRank with textual match Unpublished Many empirical parameters and human effort Criticism : Ad-hoc coupling of query relevance and graph importance Massive engineering effort Continually crawling the Web and updating page ranks

Link Manipulation

Conclusions PageRank algorithm was the first algorithm for link-based search Many extensions and improvements since then See papers on class Web page Same idea used in social networks for determining importance Real-world search involves many other aspects besides PageRank E.g., use of logistic regression for ranking Learns how to predict relevance of page (represented by bag of words) relative to a query, using historical click data See paper by Joachims on class Web page Additional slides (optional) HITS algorithm, Kleinberg, 1998

ADDITIONAL OPTIONAL SLIDES

PageRank: Limitations rich get richer syndrome not as democratic as originally (nobly) claimed certainly not 1 vote per WWW citizen also: crawling frequency tends to be based on pagerank for detailed grumblings, see www.google-watch.org, etc. not query-sensitive random walk same regardless of query topic whereas real random surfer has some topic interests non-uniform jumping vector needed would enable personalization (but requires faster eigenvector convergence) Topic of ongoing research ad hoc mix of PageRank & keyword match score done in two steps for efficiency, not quality motivations

HITS: Hub and Authority Rankings J. Kleinberg, Authorative sources in a hyperlinked environment, Proceedings of ACM SODA Conference, 1998. HITS Hypertext Induced Topic Selection Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u]. Recursive quantitative definitions of hub and authority scores Relies on query-time processing To select base set Vq of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r \in R via an inbound or outbound edge (expanded set) To deduce hubs and authorities that exist in a sub-graph of the Web

Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Authority and Hubness Convergence Recursive dependency: a(v) Σ h(w) w Є pa[v] h(v) Σ a(w) w Є ch[v] Using Linear Algebra, we can prove: a(v) and h(v) converge

Find a base subgraph: Start with a root set R {1, 2, 3, 4} HITS Example {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents of nodes in R A new set S (base subgraph)

HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

randomly deleted 30% of papers Stability of HITS vs PageRank (5 trials) HITS PageRank

HITS vs PageRank: Stability e.g. [Ng & Zheng & Jordan, IJCAI-01 & SIGIR-01] HITS can be very sensitive to change in small fraction of nodes/edges in link structure PageRank much more stable, due to random jumps propose HITS as bidirectional random walk with probability d, randomly (p=1/n) jump to a node with probability d-1: odd timestep: take random outlink from current node even timestep: go backward on random inlink of node this HITS variant seems much more stable as d increased issue: tuning d (d=1 most stable but useless for ranking)

Recommended Books http://www.cs.berkeley.edu/~soumen/mining-the-web/ http://www.oreilly.com/catalog/googlehks/ http://www.google.com/apis/