Class President: A Network Approach to Popularity. Due July 18, 2014

Similar documents
Google Page Rank Project Linear Algebra Summer 2012

Newton s Cooling Model in Matlab and the Cooling Project!

Eigenvalues and Eigenvectors

Math 304 Handout: Linear algebra, graphs, and networks.

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Link Analysis Ranking

Data Mining Recitation Notes Week 3

Designing Information Devices and Systems I Spring 2016 Elad Alon, Babak Ayazifar Homework 12

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

MOL410/510 Problem Set 1 - Linear Algebra - Due Friday Sept. 30

0.1 Naive formulation of PageRank

CHUNG-ANG UNIVERSITY Linear Algebra Spring Solutions to Computer Project #2

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

A Note on Google s PageRank

Calculating Web Page Authority Using the PageRank Algorithm

Designing Information Devices and Systems I Spring 2017 Babak Ayazifar, Vladimir Stojanovic Homework 4

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Designing Information Devices and Systems I Fall 2018 Homework 5

Link Mining PageRank. From Stanford C246

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Markov Chains and Related Matters

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Recitation 9: Probability Matrices and Real Symmetric Matrices. 3 Probability Matrices: Definitions and Examples

University of British Columbia Math 307, Final

Web Structure Mining Nodes, Links and Influence

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search

MATH 5524 MATRIX THEORY Pledged Problem Set 2

EE263 Review Session 1

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Online Social Networks and Media. Link Analysis and Web Search

This section is an introduction to the basic themes of the course.

Math 343 Lab 7: Line and Curve Fitting

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

The Boundary Problem: Markov Chain Solution

F = ma W = mg v = D t

CS 246 Review of Linear Algebra 01/17/19

Project Two. James K. Peterson. March 26, Department of Biological Sciences and Department of Mathematical Sciences Clemson University

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

MATH 122: Matrixology (Linear Algebra) Solutions to Level Tetris (1984), 10 of 10 University of Vermont, Fall 2016

Project Two. Outline. James K. Peterson. March 27, Cooling Models. Estimating the Cooling Rate k. Typical Cooling Project Matlab Session

APPM 2360 Project 2: Exploring Stage-Structured Population Dynamics with Loggerhead Sea Turtles

M. Matrices and Linear Algebra

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Designing Information Devices and Systems I Spring 2018 Homework 5

COMPSCI 514: Algorithms for Data Science

Lecture 7 Mathematics behind Internet Search

Designing Information Devices and Systems I Fall 2017 Homework 3

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Using web-based Java pplane applet to graph solutions of systems of differential equations

+ MATRIX VARIABLES AND TWO DIMENSIONAL ARRAYS

course overview 18.06: Linear Algebra

Math 320, spring 2011 before the first midterm

Introduction to Data Mining

LINEAR ALGEBRA KNOWLEDGE SURVEY

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

PROBLEM SET. Problems on Eigenvalues and Diagonalization. Math 3351, Fall Oct. 20, 2010 ANSWERS

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Eigenvalues and eigenvectors

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Ross Program 2017 Application Problems

ORF 363/COS 323 Final Exam, Fall 2017

Unit 4: Part 3 Solving Quadratics

Eigenvalues and Eigenvectors

Recitation 8: Graphs and Adjacency Matrices

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Solving and Graphing Inequalities

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Iterative Methods for Solving A x = b

The eigenvalues are the roots of the characteristic polynomial, det(a λi). We can compute

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

1 Searching the World Wide Web

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Methods of Mathematics

6.842 Randomness and Computation March 3, Lecture 8

3.3 Eigenvalues and Eigenvectors

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Applied Linear Algebra in Geoscience Using MATLAB

Data and Algorithms of the Web

(Mathematical Operations with Arrays) Applied Linear Algebra in Geoscience Using MATLAB

A First Course in Linear Algebra

Lecture 1 Systems of Linear Equations and Matrices

ECE 204 Numerical Methods for Computer Engineers MIDTERM EXAMINATION /8:00-9:30

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Fiedler s Theorems on Nodal Domains

Jeffrey D. Ullman Stanford University

University of British Columbia Math 307, Final

Announcements: Warm-up Exercise:

Slides based on those in:

Transcription:

Class President: A Network Approach to Popularity Due July 8, 24

Instructions. Due Fri, July 8 at :59 PM 2. Work in groups of up to 3 3. Type up the report, and submit as a pdf on D2L 4. Attach the code used as a separate.m file 5. Have fun! Introduction Last year, your high school was shocked when the election for class president was rigged by an unscrupulous student. While this may still bode well for the student s chances in American politics, the school wants to take steps to prevent such a thing from happening again. The principal has decided to move from an election to an objective ranking system. However, the president will still be chosen by popularity alone. How can we rank the students popularity without voting? By utilizing the power of Facebook, as well as the Page Rank algorithm, we will be able to observe just how popular each candidate is. In this lab, you will learn about the eigenvector centrality method, a tool from network theory used to rank the importance of "nodes" in the network. The method has uses in many different fields, most notably the Page Rank algorithm used to rank Google search results. In our case, it will allow us to impartially rank the popularity of the candidates for class president. Which friends are most significant? The first idea to determine popularity may be simply to count the number of Facebook friends each candidate has on Facebook. However, this method has a few weaknesses. It treats each friend as equal. That is, whether the friend is your little brother or the captain of the cheer squad, the impact is the same. It could also be influenced by making fake accounts and adding them as friends. Thus, we want to be able to take into account the impact of each friend. Fortunately, the field of graph theory is able to provide a tool to do this, known as eigenvector centrality. This is the same tool used in the original Google "Page Rank" algorithm. Centrality In graph or network theory, there are several types of centrality. Centrality is essentially a determination of the relative importance of a vertex within a graph. An example of a graph is below: 2

A, B, C, and D are the vertex or nodes of the graph which represents a network. As pictured, the graph is directed. This means that there is a connection from A to B, but not from B to A. This describes networks such as power grids, where energy may only be flowing in one direction. If, instead, we ignore the arrows and think of the connections as lines, we have an undirected graph. This describes the election scenario, since if I am Facebook friends with you, you must also be Facebook friends with me. For the election, each student is a node, and the connections are Facebook friendship. We wish to find which node is most significant. We will use a method known as Eigenvector Centrality. We can measure the importance of a student in our network by creating an adjacency matrix. Adjacency matrix In network theory, an adjacency matrix represents which nodes in a network are connected. Assume that the four nodes above are four students, and the connections are not directed. Our adjacency matrix will be a 4x4 matrix. Lets call the matrix M. Node A is student. Node B is student 2. Node C is student 3, and node D is student 4. When we look at the indices of M ij, the i,j-th entry will either have a or a. If there is a friendship (i.e. connection) between the i-th and j-th student, then the entry in the i,j-th position will be a. If there is no friendship, it will be a. The adjacency matrix of the network above is given below. Notice, the matrix has zeros down the diagonal. This should be quite natural since a student is not friends with themselves, which is what the i,i-th entry of represents. However, there should be another important trend you note. If the i,j-th entry has a connection, then the j,i-th entry does as well. 3

Eigenvector Centrality Now that we can construct an adjacency matrix, we can finally use it to find the eigenvector centrality. For the i-th student, let the centrality score be proportional to the sum of the scores of all students which are connected to it. Therefore, we get x i = α jɛx i x j. Here, x i is the score for the i-th student, and X i is the set of all students connected to the i-th student. Note that this set is also represented by the connections on the i-th row of the adjacency matrix. Thus, if we sum each entry on the i-th row, we will only add the scores of the students we are connected to. All other students will have their scores multiplied by zero and added for no effect. It looks like this x i = α N M i,j x j. j= Here, N is the number of students in your network. In vector notation, we can rewrite this as x = αmx. We can move α to the left hand side, and we get α x = Mx. Since α is just a proportionality constant, we can replace α with λ. Thus, we see that the vector of the scores of each student is an eigenvector. In fact, it is the eigenvector with the largest eigenvalue, although this fact is not apparent. The larger the magnitude of the value, the more popular the student! Note that the sign of the vector does not matter, as it can be scaled however we like. Finding Eigenvectors in Matlab The eigs command in Matlab will calculate the eigenvectors and eigenvalues of a matrix. If you enter, [AB] = eigs(a, ) A will be a be the eigenvector with the largest eigenvalue and B will be that eigenvalue. You can experiment increasing the second argument to get more eigenvalues, but we wont use them in this lab. Now we can do it ourselves! You didn t think you would be able to get off so easily, just using a Matlab command, did you? We will now use the Power Method to find the eigenvector with the the largest eigenvalue ourselves! This method is pretty magical. If we have a sufficiently " nice" matrix, we can find this eigenvector just by repeatedly multiplying a random vector by the matrix. This works because the eigenvectors are and eigenvalues are a way of "breaking down" what the matrix is doing to each part of the matrix. For example: 2 3 Has eigenvalues,2,3, with corresponding eigenvectors: We can write any vector x: a x = b = a + b + c c 4

Thus, Ax = A a + b + c = aa + ba + ca = a + 2b + 3c A n x = a() n + (2) n b + (3) n c Notice how we used the definition of an eigenvector. As we increase n, the vector with the 3 n scaling is "taking over" the other two eigenvectors, because it has the largest eigenvalue. We don t want our vectors getting so big, so, after each multiplication, we rescale so that the vector is length. We repeat this process until the vector stops changing when multiplied by A. In math symbols, we start with a random b, then repeat b k+ = Ab k Ab k If b(i) is the i th element of the vector b of length n, b = n b(i) 2 i= We can calculate this using the Matlab norm command. Rate of Convergence When working with a numerical method, we are interested in the rate of convergence. The rate of convergence determines the number of iterations needed to find the solution. To find the rate, we first need to define a metric, which is used to measure how far from the solution the current iteration is from the true solution. If b is the true eigenvector corresponding to the leading eigenvalue, the error at iteration k is given by: e k = b k b The rate of convergence is the ratio: r = e k+ e k Suprisingly, for this method, and many others, this r will stay about the same for every iteration. This property is known as linear convergence Warming up ) What kind of matrix has the following property? M i,j = M j,i 2) Does the friendship adjacency matrix have the property described in )? Why or why not? 3) When Google uses the Page Rank algorithm, it creates a graph of the network of websites, with web links serving as the connections. Is this graph directed or undirected? Why? 5

4) Construct an adjacency matrix from the following network. Does it have the property described in problem #? There are 6 books. They are simply, 2, 3, 4, 5, 6. They are connected as follows. Book cites 4, 6, and 5. Book 2 cites 3, 5, and 6. Book 3 cites 4, and 6. Book 4 cites 5. Book 5 cites 4,, and 2. Book 6 cites 2, and 3. 5) Give the centrality scores of these books. What does this tell you about the books? Ranking the Students The class president will be the student with the highest centrality score. The candidates are: Mary Friends with P, P2, P3, P4, P5, P6, P7, P8, P9. Fred Friends with P, P, P2. Veronica Friends with P3, P4, P5. Facebook data There 3 other students in the school. The adjacency matrix for those 3 is posted as student_adjacency.txt (use load). Note that, before computing the eigenvectors, you must add the three candidates into the matrix. Use the power method to find the eigenvector corresponding to the largest eigenvalue. 6) What is the rate of convergence for this problem? 7) Plot the error, e k, against k. At least how many iterations were necessary to achieve e k < 6? (You may use eigs to determine error.) 8) Rank the students P-P in the provided matrix according to their centrality scores. Repeat after adding in the three additional candidates. Where do the 3 candidates fit into the rankings? Are P-P re-arranged at all with the candidates added? Why or why not? 9) How could someone cheat to make their score higher with this system? ) What are some different flaws in this system of ranking the students popularity? How could it be changed to correct that problem? ) Would you have picked the same student to be class president? Why or why not? Are there other factors that you would take into consideration to make the system more balanced? 2) What are some examples of other networks that could be modeled using an adjacency matrix? 6