Math 304 Handout: Linear algebra, graphs, and networks.

Similar documents
Application. Stochastic Matrices and PageRank

Graph Models The PageRank Algorithm

Markov Chains Handout for Stat 110

1998: enter Link Analysis

Calculating Web Page Authority Using the PageRank Algorithm

Google Page Rank Project Linear Algebra Summer 2012

Algebraic Representation of Networks

Link Analysis. Leonid E. Zhukov

A Note on Google s PageRank

Data Mining Recitation Notes Week 3

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Online Social Networks and Media. Link Analysis and Web Search

0.1 Naive formulation of PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

IR: Information Retrieval

Inf 2B: Ranking Queries on the WWW

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

MATH36001 Perron Frobenius Theory 2015

Link Analysis Ranking

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

How does Google rank webpages?

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Uncertainty and Randomization

ECEN 689 Special Topics in Data Science for Communications Networks

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

Markov Chains. As part of Interdisciplinary Mathematical Modeling, By Warren Weckesser Copyright c 2006.

1 Searching the World Wide Web

Online Social Networks and Media. Link Analysis and Web Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Recitation 8: Graphs and Adjacency Matrices

Class President: A Network Approach to Popularity. Due July 18, 2014

Homework set 2 - Solutions

Linear Algebra Practice Problems

Announcements: Warm-up Exercise:

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

This section is an introduction to the basic themes of the course.

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

MATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems

Math 2J Lecture 16-11/02/12

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

STA141C: Big Data & High Performance Statistical Computing

Justification and Application of Eigenvector Centrality

Math Camp Notes: Linear Algebra II

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Lecture 7 Mathematics behind Internet Search

CS6220: DATA MINING TECHNIQUES

2. The Power Method for Eigenvectors

Lecture: Local Spectral Methods (1 of 4)

Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur

eigenvalues, markov matrices, and the power method

COMPSCI 514: Algorithms for Data Science

Modelling data networks stochastic processes and Markov chains

MAS275 Probability Modelling Exercises

Chapter 29 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M.

MAT 1302B Mathematical Methods II

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

18.440: Lecture 33 Markov Chains

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Applications to network analysis: Eigenvector centrality indices Lecture notes

Lecture 5: Introduction to Markov Chains

2.2 Graphs of Functions

18.600: Lecture 32 Markov Chains

Node and Link Analysis

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

CHAPTER 6. Markov Chains

Finding central nodes in large networks


Data Mining and Matrices

Randomization and Gossiping in Techno-Social Networks

Page rank computation HPC course project a.y

Link Analysis and Web Search

On the mathematical background of Google PageRank algorithm

spring, math 204 (mitchell) list of theorems 1 Linear Systems Linear Transformations Matrix Algebra

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Vector, Matrix, and Tensor Derivatives

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Graph fundamentals. Matrices associated with a graph

Linear Algebra Solutions 1

Section 20: Arrow Diagrams on the Integers

4.1 Markov Processes and Markov Chains

Eigenvalues of Exponentiated Adjacency Matrices

Singular Value Decompsition

Applications of The Perron-Frobenius Theorem

The Google Markov Chain: convergence speed and eigenvalues

The Giving Game: Google Page Rank

Link Mining PageRank. From Stanford C246

Wed Feb The vector spaces 2, 3, n. Announcements: Warm-up Exercise:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CS6220: DATA MINING TECHNIQUES

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Google and Biosequence searches with Markov Chains

Transcription:

Math 30 Handout: Linear algebra, graphs, and networks. December, 006. GRAPHS AND ADJACENCY MATRICES. Definition. A graph is a collection of vertices connected by edges. A directed graph is a graph all of whose edges have directions, usually indicated by arrows. 6 3 FIGURE. A graph. FIGURE. A directed graph. Example. A DC electrical network is a directed graph. An AC electrical network is a graph. The Internet is a graph, with computers and servers as vertices and cables as edges. The World Wide Web is a directed graph, with web pages as vertices and links as edges. Towns connected by freeways form a graph. Squares connected by one-way streets form a directed graph.

Instead of drawing a graph, one can summarize it in a matrix. Label the vertices of a graph,, 3,..., n. The adjacency matrix of a graph is the n n matrix A such that a ij is the number of edges connecting the j th and i th vertices. Similarly, for a directed graph the adjacency matrix is the n n matrix A such that a ij is the number of edges going from the j th to the i th vertex. Example. The adjacency matrices for the graphs in Figures and are and 0 0 0 0 0 0 0 0 0 0 A = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 B = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 For example, there is edge connecting the th and the st vertex, edge going from the th to the st vertex, and no edges going from the st to the th vertex. Note that the adjacency matrix of a graph is always symmetric. Besides describing a graph as a bunch of numbers rather than as a picture, adjacency matrices have other uses. To find out which vertices can be reached from the st vertex in one step, we look at the entries of the st column of A. The non-zero entries are the nd, 3rd, and th, so those are the vertices that can be reached in one step. What about vertices that can be reached in two steps? It is not hard to see that these correspond to the entries in the column of A. 3 0 3 0 0 A 3 3 0 =. 0 3 0 3 0 0 6 0 0 0 0 Thus in two steps, one can go from the st vertex back to itself in three ways (by visiting the nd, 3rd, or th vertices first), one can get to the nd vertex in one way (by passing through the 3rd vertex), and one still cannot get to the 6th vertex. In a large network, one can use the adjacency matrix to determine the distance between two vertices. Exercise. Describe how to interpret the entries of the powers of the adjacency matrix of a directed graph. Exercise. Explain the 6 entry in the A matrix above.

. WORD SEARCH How do search engines work? The first issue is to find all the web pages; as far as I remember, the first efficient web crawler was Altavista in 99. The second issue is to store, for all pages on the web, all the words that they contain. Again, an efficient way to do this is using a (very, very large) matrix. Each column corresponds to a web page, each row to a word, and each entry is if a particular word occurs on a particular web page, and 0 otherwise. Example. Suppose that we have web pages, the search words are linear, algebra, matrix, determinant, transformation, and the occurrences of each word on each page are given by the following table: P P P3 P P linear 0 0 algebra 0 matrix 0 0 0 determinant 0 0 transformation 0 0 The corresponding matrix is 0 0 0 C = 0 0 0 0 0 0 0 If we want to find out which web pages contain the terms linear algebra, we form the corresponding vector x = (,, 0, 0, 0) T and calculate 0 0 C T x = 0 0 0 0 0 0 0 =. 0 0 0 0 Exercise. Why do we take C T? We see that both words occur on the st and 3rd pages. Similarly, to find the pages containing the terms determinant of a linear transformation, we would calculate 0 0 0 0 0 0 0 0 0 = 3, 0 0 0 so that only the 3rd page contains all the terms.

3. RANKING THE WEB PAGES. Once we know which web pages contain the desired words, how do we choose between these pages? An early approach was to return the pages that contain many occurrences of the search words. This idea is easily abused, since it is easy to put on a web page many repetitions of popular search words that have nothing to do with the page content. A better approach is to use the graph structure of the Web. A page is important if many people think it is, and so there are many links to it. This idea is also easily defeated: a company can create lots of dummy sites to point to its main site, thereby giving an appearance that that site is popular. Instead, say that a site is important if there are many important sites that point to it. This seems like a circular definition, but one can make sense of it, and in 999, Sergey Brin and Larry Page (following some ideas of Jon Kleinberg) based the Google search engine on it. Think of the Web as a giant Markov chain, where at every site a user randomly picks one of the links, goes on to the next site, again randomly picks a link, and continues in this manner. The matrix describing this Markov chain is almost the adjacency matrix of the directed graph, except that we want the matrix to be stochastic, and so normalize each column to have sum one. See the example below. The stationary vector of this matrix will tell us what proportion of time, on average, will a used spend at each site, which shows how popular or important that site is.

Example. Here is a network of web pages and their links. The th site has many naturally occurring links pointing to it. The th site also has many links pointing to it, but they are dummy links since they all come from the th site. The stochastic matrix corresponding to this network is 3 0 6 8 7 9 3 FIGURE 3. A large network. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.

Its eigenvalues are approximately, 0.9 ± 0.8i, 0.9 ± 0.70i, 0.87, 0.7, 0., 0, 0, 0, 0, 0, 0, 0 The eigenvector with eigenvalue and positive entries that add up to is approximately (0.0, 0.0, 0.0, 0.0, 0.9, 0.0, 0.03, 0., 0.0, 0.9, 0., 0.9, 0.03, 0.0, 0.03) T Thus the ordering of the web pages is:, 0 (because points to it), (because 0 points to it), 8 and (because points to them), and the rest are unimportant. 3 0 6 8 7 9 3 FIGURE. A large network. Remark. This matrix does not have positive entries (it contains many zeros), so we cannot apply the theorem discussed in class to conclude that any initial distribution will lead to the stationary distribution. We can use Theorem 6.3. (since is the dominant eigenvalue), or use the fact that the network is transitive (one can get from any vertex to any vertex) to conclude that some power of D has positive entries, which is enough. An alternative, actually implemented in Google, is to modify our model a little. We assume that at page, the user may follow any of the links (with equal probabilities), but also, with some small probability, jump to any other page. This is also a reasonable model, and mathematically it makes every entry of the matrix positive (if very small). The resulting eigenvalues and eigenvectors are not significantly different from the ones we calculated.