What is this Page Known for? Computing Web Page Reputations. Outline

Similar documents
Link Analysis Ranking

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Online Social Networks and Media. Link Analysis and Web Search

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides based on those in:

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Online Social Networks and Media. Link Analysis and Web Search

1 Searching the World Wide Web

Data Mining and Matrices

Jeffrey D. Ullman Stanford University

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

CS6220: DATA MINING TECHNIQUES

1998: enter Link Analysis

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Degree Distribution: The case of Citation Networks

Lecture 12: Link Analysis for Web Retrieval

Models and Algorithms for Complex Networks. Link Analysis Ranking

How does Google rank webpages?

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Faloutsos, Tong ICDE, 2009

Data and Algorithms of the Web

Finding central nodes in large networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Math 304 Handout: Linear algebra, graphs, and networks.

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

STA141C: Big Data & High Performance Statistical Computing

CS6220: DATA MINING TECHNIQUES

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Jeffrey D. Ullman Stanford University

A Note on Google s PageRank

0.1 Naive formulation of PageRank

Data Mining Recitation Notes Week 3

Link Analysis. Leonid E. Zhukov

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Finding Authorities and Hubs From Link Structures on the World Wide Web

Introduction to Data Mining

CS 188: Artificial Intelligence

Perturbation of the hyper-linked environment

CS 343: Artificial Intelligence

Computing PageRank using Power Extrapolation

Approximate Inference

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Fall Recap: Inference Example

Node Centrality and Ranking on Networks

Inf 2B: Ranking Queries on the WWW

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Updating PageRank. Amy Langville Carl Meyer

Link Mining PageRank. From Stanford C246

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Link Analysis. Stony Brook University CSE545, Fall 2016

arxiv:cond-mat/ v1 3 Sep 2004

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

IR: Information Retrieval

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Cross-lingual and temporal Wikipedia analysis

Analysis (SALSA) and the TKC Eect. R. Lempel S. Moran. Department of Computer Science. The Technion, Haifa 32000, Israel

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Calculating Web Page Authority Using the PageRank Algorithm

Lecture 7 Mathematics behind Internet Search

Analysis (SALSA) and the TKC Eect. R. Lempel S. Moran. Department of Computer Science. The Technion, Haifa 32000, Israel

ECEN 689 Special Topics in Data Science for Communications Networks

Introduction to Information Retrieval

CS 188: Artificial Intelligence Spring 2009

CS249: ADVANCED DATA MINING

Complex Social System, Elections. Introduction to Network Analysis 1

CS47300: Web Information Search and Management

Lecture VI Introduction to complex networks. Santo Fortunato

How Does Google?! A journey into the wondrous mathematics behind your favorite websites. David F. Gleich! Computer Science! Purdue University!

Markov Chains and Hidden Markov Models

Web Structure Mining Nodes, Links and Influence

Applications. Nonnegative Matrices: Ranking

Introduction to Information Retrieval

Machine Learning CPSC 340. Tutorial 12

1 Equilibrium Comparisons

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

PageRank: Ranking of nodes in graphs

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

CS 5522: Artificial Intelligence II

Information Retrieval and Organisation

Are You Maximizing The Value Of All Your Data?

Eigenvalue Problems Computation and Applications

MATH36061 Convex Optimization

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

Uncertainty and Randomization

Transcription:

What is this Page Known for? Computing Web Page Reputations Davood Rafiei University of Alberta http://www.cs.ualberta.ca/~drafiei Joint work with Alberto Mendelzon (U. of Toronto) 1 Outline Scenarios and Motivation Three definitions of rank Page Rank Reputation Measure Hubs and Authorities The TOPIC prototype Future work 2

Scenarios Search engine Search-U-Matic just returned 80,000 pages on query liver disease. Where should I start? We are spending $200K/year maintaining our Web pages. Are we projecting the right image? Prof. X, an expert on Canadian literature is up for tenure. How well-known is her research on the Web? How does our site rank in popularity among all Linux sites? 3 Web Users Make poor queries Short: 1 or 2 terms Imprecise: terms often have different meanings Behaviour: low effort 78% of queries are not modified 85% browse only the first screen Follow links 4

Our Ranking Problem: Given a page p and topic t, compute the rank of p on t. Typical Usage: Given topic t (or page p), find a ranked list pages (or topics). Approach: Analyze links to find pages that are better / well-known / more authoritative than others on some topics. 5 Defining Rank: Simple Importance Ranking u x v Citation Analysis: rank(p) = number of papers that cite paper p. Web: citation=link, use in-degree of a node in the Web graph. y 6

Problems All pages are treated equally. This is not true on the Web. Yahoo is much better maintained than my personal home page in geocity.com. It is topic independent. A high rank on travel Alberta doesn t imply a high rank on brain surgery. 7 Def. 1: Page Rank (Brin and Page 1998; Geller 1978 in bibliometrics) Problem: given page p, compute its rank. Idea: a page is good if lots of good pages point to it. The rank of a page depends on not only the number of its incoming links, but also the ranks of those links. Adopted by Google search engine. high-ranked pages are returned first. 8

! % & ( ), # $ Page Rank Formulation Perform a one-level random walk: " At each step: With prob. d>0 jump to a random page, and With prob. (1-d) follow a random link from the current page. Rank(p) = probability, in the limit, of hitting page p. R ( p ) = (1 d ) q p ) Limitation: ' query and topic-independent. R ( q ) + O ( q d N 9 Our One-Level Random Walk Imagine a user searching for pages on topic t. The user at each step * With prob. d>0 jumps into a random page on topic t, + With prob. (1-d) follows an outgoing link of the current page. Rank(p, = probability, in the limit, of hitting page p when searching for topic t. 10

-. Probability of Visiting a Page q p R n n 1 d R ( q, ( p, = (1 d) + Nt q p O( q) 0 if pagep is on topict otherwise 11 Def. 2: Reputation Measurement Problem: Given page p and topic t, compute the rank of p on t. Idea: consider search engines compared my favorite search engines a review of search engines p What can we say about page p? 12

/ 2 3 6 7 : 1 5 Reputation Measurement Formulation Let 0 I(p, = number of pages on topic t that point to p, Nt = number of pages on topic t. Define RM ( p, = I ( p, / N t Compute using search engine queries 4 +link:p +t and +t. 13 Def. 3: Hubs and Authorities (Kleinberg 1998) Problem: Given topic t, find pages p with high rank on t. Idea: 8 A page is a good hub for t if it points to good authorities on t. 9 A page is a good authority on t if good hubs for t point to it. Algorithm: finding authorities on t ; Issue the query t to a search engine, < Take the first 200 answers and add pages at distance 1. = Compute hubs and authorities for t within this set. 14

> A D F H @ B C A Two-level Random Walk Imagine the user at each step? With probability d>0 jumps into a page on topic t chosen uniformly at random, With probability (1-d) follows a random link forward/backward from the current page alternating directions. Pages accumulate Forward visits, Backward visits. 15 Probability of Visiting a Page Authority Rank of page p on term t: E A(p, = probability of a forward visit to page p when searching for term t. Hub Rank of page p on term t G H(p, = probability of a backward visit to page p when searching for term t. Theorem: If d>0, the two-level random walk has a unique stationary distribution denoted by A(p, and H(p,. 16

I L K N O Inverting H&A Computation Topic H&A Pages Page? Topics 17 Two Solutions Search engine solution: J A large crawl of the Web is available, Find authorities on t for each term t Real-time solution: M Approximate the search engine solution by: Starting from some set of pages and the terms that appear in them, Iteratively expanding this set. 18

Search Engine Solution For every page p and term t A ( p, = H ( p, = 1 2 N t A( p, = H ( p, = 0 if t appears otherwise in p While changes occur A ( p, t ) = (1 d ) q p H ( q, t ) O ( q ) d + 2 N 0 t if t appears in page p otherwise H ( p, = (1 d ) p q A( q, + I ( q) d 2 N 0 t if page p is on topic t otherwise 19 Real-time Solution Set of pages: p q i Set of terms: all terms t that appear in p or some of the q i s. 20

P Q Real-time Algorithm (using the one-level model for simplicity) R ( p, t ) = d N t For i =1,2,,k For each path q1 q2... qi p, For each term t in q 1 i (1 d) R( p, = R( p, + i O q ( i ) j= 1 d N t 21 Simplification k=1, O(q)=constant R( 1 p, = C (q contains q p N t That is, R(p, I(p,/Nt 22

R S U V W TOPIC:Current Prototype URL:www.cs.toronto.edu/db/topic A crude approximation Given page p T Find 1,000 pages q that link to p (using Alta Vista or Lycos) From each q snippet, extract all terms t Remove internal links and duplicate snippets. Apply the real-time algorithm with d=0.10, k=1, O(q)=7.2. 23 24

What is page www.macleans.ca known for? Example 1 - Maclean s Magazine 2 - macleans 3 - Canadian Universities 25 Example sunsite.unc.edu/javafaq/javafaq.html Java FAQ comp.lang.java FAQ Java Tutorials Java Stuff 26

X Z \ ^ ` b d Example: Authorities on (+censorship +ne www.eff.org Y Anti-Censorship, Join the Blue Ribbon, Blue Ribbon Campaign, Electronic Frontier Foundation www.cdt.org [ Center for Democracy and Technology, Communications Decency Act, Censorship, Free Speech, Blue Ribbon www.aclu.org ] ACLU, American Civil Liberties Union, Communications Decency Act 27 Example: Personal Home Pages www.w3.org/people/berners-lee _ History Of The Internet, Tim Berners-Lee, Internet History, W3C www-db.stanford.edu/~ullman a Jeffrey D Ullman, Database Systems, Data Mining, Programming Languages www.cs.ualberta.ca/~jonathan c Jonathan Shaeffer, Chess, Alberta, Games, Computer, University www.cs.ualberta.ca/~greiner e Machine Learning 28

f i k m o q g h j Institutional Home Pages www.almaden.ibm.com (7424) IBM Almaden Research Center, Data Mining, Physics, Physical, ACM, Computer Science www.research.microsoft.com (7686) Computer Vision, Microsoft Research, Knowledge Discovery, Data Mining, Computer Graphics, Computer Science 29 Canadian CS Departments www.cs.toronto.edu (8400) l Russian History, Neural, Travel, Hockey www.cs.ualberta.ca (10557) n University of Alberta, Virtual Reality, Language, Chess, Artificial www.cs.ubc.ca (17598) p Confocal, Periodic Table, Anime, Computer Science, Manga www.cs.sfu.ca (2055) r Whales, Simon Fraser University, Data Mining, Reasoning 30

s t u v w Comparing Reputations CNN BBC ABC wired.com Int l News 0.0237 0.0097 0.0003 0.0044 Weather 0.0121 0.0052 0.0008 0.0006 Sports 0.0070 0.0004 0 0.0028 Entertainment 0.0040 0.0015 0.0013 0.0012 Travel 0.0030 0.0008 0.0012 0.0005 Technology 0.0017 0.0006 0.0006 0.0079 Business 0.0017 0.0006 0.0004 0.0031 31 TOPIC:Limitations Approximate solution Often a few links (out of a much larger se examined. Use of snippets All links are equal Simplistic notion of topic 32

x { } ~ y z Current/Future Work Ranking within the Web-in-a-box server 170 million pages 1 billion links Implementing the search engine solution Heavy computation! Approximate solutions 33 Current/Future Work Applications: Reputation server Search engine ranking clustering 34