Jeffrey D. Ullman Stanford University

Similar documents
Jeffrey D. Ullman Stanford University

Data and Algorithms of the Web

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Mining PageRank. From Stanford C246

CS6220: DATA MINING TECHNIQUES

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides based on those in:

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS6220: DATA MINING TECHNIQUES

Link Analysis. Stony Brook University CSE545, Fall 2016

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS249: ADVANCED DATA MINING

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Online Social Networks and Media. Link Analysis and Web Search

1998: enter Link Analysis

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

0.1 Naive formulation of PageRank

Data Mining Recitation Notes Week 3

Lecture 12: Link Analysis for Web Retrieval

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Data Mining

Web Structure Mining Nodes, Links and Influence

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

A Note on Google s PageRank

Jeffrey D. Ullman Stanford University

Problem Set 4. General Instructions

Data Mining Techniques

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Data Mining and Matrices

Computing PageRank using Power Extrapolation

Jeffrey D. Ullman Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Math 52: Course Summary

ECEN 689 Special Topics in Data Science for Communications Networks

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

1 Searching the World Wide Web

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

A Quick Introduction to Row Reduction

Eigenvector Review. Notes. Notes. Notes. Problems pulled from archived UBC final exams

Lecture 10: Powers of Matrices, Difference Equations

How does Google rank webpages?

Complex Social System, Elections. Introduction to Network Analysis 1

IR: Information Retrieval

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Calculating Web Page Authority Using the PageRank Algorithm

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 3

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

COMPSCI 514: Algorithms for Data Science

Lecture: Face Recognition and Feature Reduction

Applications. Nonnegative Matrices: Ranking

Lecture: Local Spectral Methods (1 of 4)

Application. Stochastic Matrices and PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Link Analysis Ranking

Google Page Rank Project Linear Algebra Summer 2012

Data Mining Techniques

Lecture: Face Recognition and Feature Reduction

Math 304 Handout: Linear algebra, graphs, and networks.

Link Analysis and Web Search

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

CS246 Final Exam, Winter 2011

Properties of Arithmetic

Exploratory Factor Analysis and Principal Component Analysis

Descriptive Statistics (And a little bit on rounding and significant digits)

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 3

Physics Lab #2: Spectroscopy

Linear equations The first case of a linear equation you learn is in one variable, for instance:

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Notes on Markov Networks

MAT 1302B Mathematical Methods II

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Designing Information Devices and Systems I Spring 2017 Babak Ayazifar, Vladimir Stojanovic Homework 6

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Week 3: Linear Regression

MPageRank: The Stability of Web Graph

CS47300: Web Information Search and Management

Introduction to Algebra: The First Week

Math 1320, Section 10 Quiz IV Solutions 20 Points

Linear Systems. Math A Bianca Santoro. September 23, 2016

Keeping well and healthy when it is really cold

Dimensionality Reduction

Paradoxes of special relativity

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Exploratory Factor Analysis and Principal Component Analysis

Transcription:

Jeffrey D. Ullman Stanford University

We ve had our first HC cases. Please, please, please, before you do anything that might violate the HC, talk to me or a TA to make sure it is legitimate. It is much easier to get caught than you might think. 2

There were a number of people who failed to upload code or HW answers properly and received no credit. Also, some people followed general SCPD directions, which you must not do in the future. We made some exceptions, e.g., allowing late code uploads. But in the future, please do not expect these sorts of exceptions to be made. 3

4 Web pages are important if people visit them a lot. But we can t watch everybody using the Web. A good surrogate for visiting pages is to assume people follow links randomly. Leads to random surfer model: Start at a random page and follow random out-links repeatedly, from whatever page you are at. PageRank = limiting probability of being at a page.

5 Solve the recursive equation: a page is important to the extent that important pages link to it. Equivalent to the random-surfer definition of PageRank. Technically, importance = the principal eigenvector of the transition matrix of the Web. A few fixups needed.

6 Number the pages 1, 2,. Page i corresponds to row and column i. M [i, j] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. M [i, j] is the probability we ll next be at page i if we are now at page j.

7 Suppose page j links to 3 pages, including i but not x. j i x 1/3 0

Suppose v is a vector whose i th component is the probability that a random walker is at page i at a certain time. If a walker follows a link from i at random, the probability distribution for walkers is then given by the vector Mv. 8

9 Starting from any vector u, the limit M (M ( M (M u ) )) is the long-term distribution of walkers. Intuition: pages are important in proportion to how likely a walker is to be there. The math: limiting distribution = principal eigenvector of M = PageRank. Note: because M has each column summing to 1, the principal eigenvalue is 1. Why? If v is the limit of MM Mu, then v satisfies the equations v = Mv.

10 Yahoo y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M soft

Because there are no constant terms, the equations v = Mv do not have a unique solution. In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution). Works if you start with any nonzero u. 11

12 Start with the vector u = [1, 1,, 1] representing the idea that each Web page is given one unit of importance. Note: it is more common to start with each vector element = 1/n, where n is the number of Web pages. Repeatedly apply the matrix M to u, allowing the importance to flow like a random walk. About 50 iterations is sufficient to estimate the limiting solution.

13 Equations v = Mv: y = y /2 + a /2 a = y /2 + m m = a /2 Note: = is really assignment. y a = m 1 1 1 1 3/2 1/2 5/4 1 3/4 9/8 11/8 1/2... 6/5 6/5 3/5

14 Yahoo Amazon M soft

15 Yahoo Amazon M soft

16 Yahoo Amazon M soft

17 Yahoo Amazon M soft

18 Yahoo Amazon M soft

20 Some pages are dead ends (have no links out). Such a page causes importance to leak out. Other groups of pages are spider traps (all outlinks are within the group). Eventually spider traps absorb all importance.

21 Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 Amazon M soft

22 Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 y a = m 1 1 1 1 1/2 1/2 3/4 1/2 1/4 5/8 3/8 1/4... 0 0 0

23 Yahoo Amazon M soft

24 Yahoo Amazon M soft

25 Yahoo Amazon M soft

26 Yahoo Amazon M soft

27 Yahoo Amazon M soft

28 Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M soft

29 Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 + m y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2... 0 0 3

30 Yahoo Amazon M soft

31 Yahoo Amazon M soft

32 Yahoo Amazon M soft

33 Yahoo Amazon M soft

34 Tax each page a fixed percentage at each iteration. Add a fixed constant to all pages. Optional but useful: add exactly enough to balance the loss (tax + PageRank of dead ends). Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step. Divided equally among all pages.

35 Equations v = 0.8(Mv) + 0.2: y = 0.8(y/2 + a/2) + 0.2 a = 0.8(y/2) + 0.2 m = 0.8(a/2 + m) + 0.2 y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688... 7/11 5/11 21/11

Goal: Evaluate Web pages not just according to their popularity, but also by how relevant they are to a particular topic, e.g. sports or history. Allows search queries to be answered based on interests of the user. Example: Search query [jaguar] wants different pages depending on whether you are interested in automobiles, nature, or sports. 37

38 Assume each walker has a small probability of teleporting at any tick. Teleport can go to: 1. Any page with equal probability. As in the taxation scheme. 2. A set of relevant pages (teleport set). For topic-specific PageRank.

39 Only Microsoft is in the teleport set. Assume 20% tax. I.e., probability of a teleport is 20%.

40 Yahoo Dr. Who s phone booth. Amazon M soft

41 Yahoo Amazon M soft

42 Yahoo Amazon M soft

43 Yahoo Amazon M soft

44 Yahoo Amazon M soft

45 Yahoo Amazon M soft

46 Yahoo Amazon M soft

1. Choose the pages belonging to the topic in Open Directory. 2. Learn, from a training set, the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set. 47

48 Spam farmers create networks of millions of pages designed to focus PageRank on a few undeserving pages. We ll discuss this technology shortly. To minimize their influence, use a teleport set consisting of trusted pages only. Example: home pages of universities.