What is this Page Known for? Computing Web Page Reputations Davood Rafiei University of Alberta http://www.cs.ualberta.ca/~drafiei Joint work with Alberto Mendelzon (U. of Toronto) 1 Outline Scenarios and Motivation Three definitions of rank Page Rank Reputation Measure Hubs and Authorities The TOPIC prototype Future work 2
Scenarios Search engine Search-U-Matic just returned 80,000 pages on query liver disease. Where should I start? We are spending $200K/year maintaining our Web pages. Are we projecting the right image? Prof. X, an expert on Canadian literature is up for tenure. How well-known is her research on the Web? How does our site rank in popularity among all Linux sites? 3 Web Users Make poor queries Short: 1 or 2 terms Imprecise: terms often have different meanings Behaviour: low effort 78% of queries are not modified 85% browse only the first screen Follow links 4
Our Ranking Problem: Given a page p and topic t, compute the rank of p on t. Typical Usage: Given topic t (or page p), find a ranked list pages (or topics). Approach: Analyze links to find pages that are better / well-known / more authoritative than others on some topics. 5 Defining Rank: Simple Importance Ranking u x v Citation Analysis: rank(p) = number of papers that cite paper p. Web: citation=link, use in-degree of a node in the Web graph. y 6
Problems All pages are treated equally. This is not true on the Web. Yahoo is much better maintained than my personal home page in geocity.com. It is topic independent. A high rank on travel Alberta doesn t imply a high rank on brain surgery. 7 Def. 1: Page Rank (Brin and Page 1998; Geller 1978 in bibliometrics) Problem: given page p, compute its rank. Idea: a page is good if lots of good pages point to it. The rank of a page depends on not only the number of its incoming links, but also the ranks of those links. Adopted by Google search engine. high-ranked pages are returned first. 8
! % & ( ), # $ Page Rank Formulation Perform a one-level random walk: " At each step: With prob. d>0 jump to a random page, and With prob. (1-d) follow a random link from the current page. Rank(p) = probability, in the limit, of hitting page p. R ( p ) = (1 d ) q p ) Limitation: ' query and topic-independent. R ( q ) + O ( q d N 9 Our One-Level Random Walk Imagine a user searching for pages on topic t. The user at each step * With prob. d>0 jumps into a random page on topic t, + With prob. (1-d) follows an outgoing link of the current page. Rank(p, = probability, in the limit, of hitting page p when searching for topic t. 10
-. Probability of Visiting a Page q p R n n 1 d R ( q, ( p, = (1 d) + Nt q p O( q) 0 if pagep is on topict otherwise 11 Def. 2: Reputation Measurement Problem: Given page p and topic t, compute the rank of p on t. Idea: consider search engines compared my favorite search engines a review of search engines p What can we say about page p? 12
/ 2 3 6 7 : 1 5 Reputation Measurement Formulation Let 0 I(p, = number of pages on topic t that point to p, Nt = number of pages on topic t. Define RM ( p, = I ( p, / N t Compute using search engine queries 4 +link:p +t and +t. 13 Def. 3: Hubs and Authorities (Kleinberg 1998) Problem: Given topic t, find pages p with high rank on t. Idea: 8 A page is a good hub for t if it points to good authorities on t. 9 A page is a good authority on t if good hubs for t point to it. Algorithm: finding authorities on t ; Issue the query t to a search engine, < Take the first 200 answers and add pages at distance 1. = Compute hubs and authorities for t within this set. 14
> A D F H @ B C A Two-level Random Walk Imagine the user at each step? With probability d>0 jumps into a page on topic t chosen uniformly at random, With probability (1-d) follows a random link forward/backward from the current page alternating directions. Pages accumulate Forward visits, Backward visits. 15 Probability of Visiting a Page Authority Rank of page p on term t: E A(p, = probability of a forward visit to page p when searching for term t. Hub Rank of page p on term t G H(p, = probability of a backward visit to page p when searching for term t. Theorem: If d>0, the two-level random walk has a unique stationary distribution denoted by A(p, and H(p,. 16
I L K N O Inverting H&A Computation Topic H&A Pages Page? Topics 17 Two Solutions Search engine solution: J A large crawl of the Web is available, Find authorities on t for each term t Real-time solution: M Approximate the search engine solution by: Starting from some set of pages and the terms that appear in them, Iteratively expanding this set. 18
Search Engine Solution For every page p and term t A ( p, = H ( p, = 1 2 N t A( p, = H ( p, = 0 if t appears otherwise in p While changes occur A ( p, t ) = (1 d ) q p H ( q, t ) O ( q ) d + 2 N 0 t if t appears in page p otherwise H ( p, = (1 d ) p q A( q, + I ( q) d 2 N 0 t if page p is on topic t otherwise 19 Real-time Solution Set of pages: p q i Set of terms: all terms t that appear in p or some of the q i s. 20
P Q Real-time Algorithm (using the one-level model for simplicity) R ( p, t ) = d N t For i =1,2,,k For each path q1 q2... qi p, For each term t in q 1 i (1 d) R( p, = R( p, + i O q ( i ) j= 1 d N t 21 Simplification k=1, O(q)=constant R( 1 p, = C (q contains q p N t That is, R(p, I(p,/Nt 22
R S U V W TOPIC:Current Prototype URL:www.cs.toronto.edu/db/topic A crude approximation Given page p T Find 1,000 pages q that link to p (using Alta Vista or Lycos) From each q snippet, extract all terms t Remove internal links and duplicate snippets. Apply the real-time algorithm with d=0.10, k=1, O(q)=7.2. 23 24
What is page www.macleans.ca known for? Example 1 - Maclean s Magazine 2 - macleans 3 - Canadian Universities 25 Example sunsite.unc.edu/javafaq/javafaq.html Java FAQ comp.lang.java FAQ Java Tutorials Java Stuff 26
X Z \ ^ ` b d Example: Authorities on (+censorship +ne www.eff.org Y Anti-Censorship, Join the Blue Ribbon, Blue Ribbon Campaign, Electronic Frontier Foundation www.cdt.org [ Center for Democracy and Technology, Communications Decency Act, Censorship, Free Speech, Blue Ribbon www.aclu.org ] ACLU, American Civil Liberties Union, Communications Decency Act 27 Example: Personal Home Pages www.w3.org/people/berners-lee _ History Of The Internet, Tim Berners-Lee, Internet History, W3C www-db.stanford.edu/~ullman a Jeffrey D Ullman, Database Systems, Data Mining, Programming Languages www.cs.ualberta.ca/~jonathan c Jonathan Shaeffer, Chess, Alberta, Games, Computer, University www.cs.ualberta.ca/~greiner e Machine Learning 28
f i k m o q g h j Institutional Home Pages www.almaden.ibm.com (7424) IBM Almaden Research Center, Data Mining, Physics, Physical, ACM, Computer Science www.research.microsoft.com (7686) Computer Vision, Microsoft Research, Knowledge Discovery, Data Mining, Computer Graphics, Computer Science 29 Canadian CS Departments www.cs.toronto.edu (8400) l Russian History, Neural, Travel, Hockey www.cs.ualberta.ca (10557) n University of Alberta, Virtual Reality, Language, Chess, Artificial www.cs.ubc.ca (17598) p Confocal, Periodic Table, Anime, Computer Science, Manga www.cs.sfu.ca (2055) r Whales, Simon Fraser University, Data Mining, Reasoning 30
s t u v w Comparing Reputations CNN BBC ABC wired.com Int l News 0.0237 0.0097 0.0003 0.0044 Weather 0.0121 0.0052 0.0008 0.0006 Sports 0.0070 0.0004 0 0.0028 Entertainment 0.0040 0.0015 0.0013 0.0012 Travel 0.0030 0.0008 0.0012 0.0005 Technology 0.0017 0.0006 0.0006 0.0079 Business 0.0017 0.0006 0.0004 0.0031 31 TOPIC:Limitations Approximate solution Often a few links (out of a much larger se examined. Use of snippets All links are equal Simplistic notion of topic 32
x { } ~ y z Current/Future Work Ranking within the Web-in-a-box server 170 million pages 1 billion links Implementing the search engine solution Heavy computation! Approximate solutions 33 Current/Future Work Applications: Reputation server Search engine ranking clustering 34