Lecture 12 : Graph Laplacians and Cheeger s Inequality

Similar documents
CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Lecture 13: Spectral Graph Theory

Machine Learning for Data Science (CS4786) Lecture 11

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

1 Matrix notation and preliminaries from spectral graph theory

Lecture Semidefinite Programming and Graph Partitioning

1 Matrix notation and preliminaries from spectral graph theory

Lecture 9: Low Rank Approximation

1 Review: symmetric matrices, their eigenvalues and eigenvectors

An Algorithmist s Toolkit September 10, Lecture 1

Spectra of Adjacency and Laplacian Matrices

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

Spectral Theorem for Self-adjoint Linear Operators

MATH 567: Mathematical Techniques in Data Science Clustering II

Conceptual Questions for Review

Spectral Clustering. Guokun Lai 2016/10

Networks and Their Spectra

Graph Partitioning Algorithms and Laplacian Eigenvalues

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Spectral Clustering. Zitao Liu

Lecture 1 and 2: Random Spanning Trees

MATH Linear Algebra

Spectral Clustering on Handwritten Digits Database

Lecture 7. 1 Normalized Adjacency and Laplacian Matrices

MATH 829: Introduction to Data Mining and Analysis Clustering II

1 Adjacency matrix and eigenvalues

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Lecture 10: October 27, 2016

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 2 Luca Trevisan August 29, 2017

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

MTH 2032 SemesterII

ORIE 6334 Spectral Graph Theory November 22, Lecture 25

22m:033 Notes: 7.1 Diagonalization of Symmetric Matrices

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Markov Chains and Spectral Clustering

The Simplest Construction of Expanders

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

COMPSCI 514: Algorithms for Data Science

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Linear Algebra, Summer 2011, pt. 3

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

16.1 L.P. Duality Applied to the Minimax Theorem

Ma/CS 6b Class 23: Eigenvalues in Regular Graphs

Linear Algebra Massoud Malek

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

MATH 567: Mathematical Techniques in Data Science Clustering II

Math 290-2: Linear Algebra & Multivariable Calculus Northwestern University, Lecture Notes

REVIEW FOR EXAM III SIMILARITY AND DIAGONALIZATION

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

18.312: Algebraic Combinatorics Lionel Levine. Lecture 19

ON THE QUALITY OF SPECTRAL SEPARATORS

Designing Information Devices and Systems II

Quantum Computing Lecture 2. Review of Linear Algebra

Lecture Introduction. 2 Brief Recap of Lecture 10. CS-621 Theory Gems October 24, 2012

(v, w) = arccos( < v, w >

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

MATH 167: APPLIED LINEAR ALGEBRA Least-Squares

A lower bound for the Laplacian eigenvalues of a graph proof of a conjecture by Guo

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Math Linear Algebra II. 1. Inner Products and Norms

There are two things that are particularly nice about the first basis

Diffusion and random walks on graphs

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Lecture 1. 1 if (i, j) E a i,j = 0 otherwise, l i,j = d i if i = j, and

Linear Models Review

Statistical Machine Learning

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence)

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

Chapter 6: Orthogonality

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

7 Principal Component Analysis

Dot Products, Transposes, and Orthogonal Projections

Lecture 5. Max-cut, Expansion and Grothendieck s Inequality

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Approximation Algorithms

8.1 Concentration inequality for Gaussian random matrix (cont d)

Chapter 3 Transformations

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Lecture 7: Spectral Graph Theory II

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Homework 2. Solutions T =

18.06 Quiz 2 April 7, 2010 Professor Strang

Graph Partitioning Using Random Walks

CS 246 Review of Linear Algebra 01/17/19

2. Review of Linear Algebra

Inverse Power Method for Non-linear Eigenproblems

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Lecture 15: Expanders

Lecture 5: November 19, Minimizing the maximum intracluster distance

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Part I: Preliminary Results. Pak K. Chan, Martine Schlag and Jason Zien. Computer Engineering Board of Studies. University of California, Santa Cruz

Answer Key for Exam #2

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

1 Kernel methods & optimization

Transcription:

CPS290: Algorithmic Foundations of Data Science March 7, 2017 Lecture 12 : Graph Laplacians and Cheeger s Inequality Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Graph Laplacian Maybe the most beautiful connection between a discrete object (graph) and a continuous object (eigenvalue) comes from a concept called the Laplacian. It gives a surprisingly simple and algebraic method for splitting a graph into two pieces without cutting too many edges. In a sense, we take an NP-Complete problem of partitioning a graph and relax it to an easy to solve problem of finding an eigenvector of a certain type of matrix. The resulting partition is not optimal, but close to it in a certain provable sense. The algorithm is very practical and modifications of it are widely used for clustering objects into similar groups. Suppose we are given n items and some measure of similarity between them. Let w ij denote the similarity score between items i and j. Let us assume this function is symmetric, so that w ij = w ji. We will treat the general version later (without proofs), but will consider a simpler version for presenting proofs. Let us assume w ij {0, 1}, where w ij = 1 means i and j are similar, and dissimilar otherwise. This defines an undirected, unweighted graph G(V, E), where V is the set of items, and E is the set of edges corresponding to pairs of items that are similar. We further assume all vertices in this graph have degree exactly d. We will relax all these assumptions later. Let A denote the adjacency matrix of G, and let D denote the diagonal matrix whose diagonal entries are all d. Then, the unnormalized Laplacian of G is given by Basic Properties It is easy to check that for any vector v, we have v T L v = L = D A (i,j) E,i<j (v i v j ) 2 Since the above quantity is always non-negative, the matrix L is PSD, and has eigenvalues 0 λ 1 λ 2 λ n. Further, note that if v = 1, then v T L v = 0. This means λ 1 = 0 and a corresponding eigenvector is 1. We now show something stronger. Claim 1. Suppose 0 = λ 1 = = λ k < λ k+1, then the number of connected components is exactly k. Proof. We will show that there are exactly k orthonormal eigenvectors for the 0 eigenvalue, when the graph has k connected components. First note that if v is a unit length eigenvector corresponding to any eigenvalue λ, then v T L v = (v i v j ) 2 = λ (i,j) E,i<j 1

2 GRAPH PARTITIONING Suppose λ = 0, then it is easy to check that in the corresponding eigenvector v, for any connected component S i, all vertices j S i have the same value v j. If there are k connected component, we can construct k orthogonal eigenvectors as follows: For each i = 1, 2,..., k, construct vector v i by setting v ij = 1 for j S i and zero everywhere else. These k vectors are clearly orthogonal, which shows that the multiplicity of the zero eigenvalue is at least k. On the other hand, there is no other eigenvector for the zero eigenvalue that is orthogonal to these k. To see this, any such eigenvector must take the same value on all vertices within a connected component. So it s dot product with any of the k eigenvectors we constructed above has to be non-zero. This shows that the multiplicity of the zero eigenvalue is exactly the number of components k, that is, the corresponding eigenvectors span a subspace of dimension k. Graph Partitioning If we only wish to find the number of connected components, the Laplacian seems like a convoluted way to go about doing it. In practice, typically the graph is connected, and we need to split it into components such that items within a component are much more similar with each other than with items in a different component. Such a problem is usually termed clustering or unsupervised learning. We will now consider the simplest problem of splitting the graph into two pieces. We want the partition to have two properties: The number of edges going from one side of the partition to the other is small; and the number of edges within each partition is large. One way to do it is to find a imum cut in the graph a partition that imizes the number of edges going across. This can be done by network flows in polynomial time (more later), but simply imizing the number of edges crossing the partition can lead to silly solutions, for instance, it may be optimal to split off one vertex from everything else. A more reasonable solution would attempt to imize the number of edges while simultaneously making sure each side of the partition has a lot of edges. Let (S, V \ S) be a partition. We define the cut δ(s) as the set of edges with one end-point in S and the other in V \ S. Similarly, we define the volume of the partition as vol(s) = i S d i, where d i is the degree of i. Since the graph is regular, this is the same as d. Given these, we can define a good partition as one that imizes the conductance, defined as: (vol(s), vol(v \ S)) = 1 d (, V \ S ) This attempts to simultaneously decrease the cut size while increase the volume of the partitions. Without loss of generality, we can define S to be the smaller side of the partition, and we can ignore the factor of d. Therefore, define Θ G = S V, V /2 Our goal is to find an S that imizes the RHS. This is the imum normalized cut problem. Note that Θ G d is the conductance of the graph defined above. The quantity is termed the normalized cut size of cut S; for a regular graph, conductance is proportional to the normalized cut size, so imizing either of them is the same. Connection to Laplacian Finding a cut of imum conductance is NP-Complete. However, conductance is closely related to the graph Laplacian. Take any set S V of size at most V /2. Consider the vector v S defined

The Spectral Method 3 as follows: For j S, v Sj = 1 / V, and for j / S, we set v Sj = / V. It is an easy exercise to see that v S is perpendicular to 1. Further, and Therefore, ( v S, v S = 1 ) 2 ( ) 2 V + ( V ) = V V V v S T L v S = (1 / V + / V ) 2 = v S T L v S v S T v S = V \S V = 1 1 / V = c where c [1, 2] since V /2. The LHS of the above relation is at least λ 2, since v T L v λ 2 = v 1 v T v and since v S 1. Since all this holds for any S, it holds for the infimum, so that λ 2 2 Ss.t. V /2 = 2Θ G Therefore, the imum normalized cut size of a regular graph is at least λ 2 /2. In fact, if we restricted to unit vectors with two entries, one positive and one negative, that are perpendicular to 1, then the quantity v T L v is (within a factor of 2) the normalized cut size of the cut that separates the vertices where v takes the positive value from the vertices that v takes negative values. The imum conductance cut therefore (roughly speaking) imizes v T L v over unit vectors perpendicular to 1 that takes only two values for all its coordinates. However, optimizing over this set of vectors is NP-Complete. What we can do instead is optimize over the space of all unit vectors v that are perpendicular to 1. This yields the second eigenvector of L as its optimal solution. Clearly, the resulting objective value, the second eigenvalue, is smaller than the normalized cut size. This is called a relaxation of the original problem, since we replaced a hard discrete optimization problem (conductance) with an easier continuous optimization problem (eigenvalue). The Spectral Method The question then becomes: Suppose we solve for the second eigenvector. Can we convert this to a good cut? If so, how good is this cut? This is the spectral algorithm. Let v denote the second smallest eigenvector of L. Sort the vertices i of G in increasing order of v i. Let the resulting ordering be v 1 v 2 v n. For each i, consider the cut {1, 2,..., i} and its complement. Calculate its conductance. Among these n 1 cuts, choose the one with imum conductance. The above algorithm need not find the optimum conductance cut, but has the advantage of being efficient all it involves is an eigenvalue computation, followed by sorting, followed by calculating the conductance of n 1 cuts. How good is this algorithm? This is shown by the following theorem.

4 GRAPH PARTITIONING Theorem 1 (Cheeger s Inequality). Let G be a regular connected graph with conductance Θ G d. Let Γ denote the normalized cut size of the cut found by the spectral algorithm, and let λ 2 denote the value of the second smallest eigenvector of the Laplacian. Then λ 2 2 Θ G Γ 2dλ 2 This shows that the spectral method has pretty good performance its conductance is roughly the square root of the optimal conductance. So if the optimum is very small, the spectral method will output a decent cut. This gives an extremely practical way to partition a graph into two pieces so that the vertices within a piece are on average more similar to each other than to vertices in the opposite side. The proof of this theorem is not hard, but beyond the scope of this course. General Graph In the above discussion, we assumed the graph is regular. Even if the graph is arbitrary, the Laplacian can be constructed in the same way, and the spectral algorithm can be run as before. However, Cheeger s inequality only holds for a slightly different definition of Laplacian, termed the Normalized Laplacian. Define the volume of a set S as vol(s) = i S d i. Define the conductance of the graph as: φ G = S vol(s) vol(v )/2 vol(s) Note that for a regular graph, dφ G = Θ G. Next define the normalized Laplacian as L = I D 1/2 AD 1/2, where A is the adjacency matrix, and D is the diagonal matrix of degrees. As before, compute the second eigenvector of L and apply the spectral method to construct n 1 cuts and choose the imum conductance cut among them. Then Theorem 2 (Cheeger s Inequality). Let G be a connected graph with conductance φ G. Let γ denote the conductance of the cut found by the spectral algorithm applied to the normalized Laplacian, and let λ 2 denote the value of the second smallest eigenvector of the normalized Laplacian. Then we have the following result. λ 2 2 φ G γ 2λ 2 Note that for regular graphs, both theorems are identical the normalized Laplacian is 1/d times the Laplacian, which means the second eigenvalue is 1/d times smaller. Further, the normalized cut size is d times the conductance. If you put these together, you can show that both theorems are saying the same thing. The main point is that for non-regular graph, the right Laplacian is the normalized one, and the right measure of the quality of a cut is its conductance. However, we can use the Laplacian D A for running the algorithm; in practice, it does reasonably well. We can further generalize the algorithm to handle weights on edges. The generalization is simple if we replace degree with weighted degree, and volume by weighted volume. The definitions above generalize naturally, and so does the algorithm, and performance guarantee. Generalization to Many Partitions To partition a graph into more pieces, one approach is to consider higher eigenvectors. For instance, we can take the smallest k eigenvectors, and project the vertices onto this space. This given n points

Generalization to Many Partitions 5 in k dimensional Euclidean space. These points are usually simpler to partition into k groups, since the eigenvectors have already done most of the hard work! Suppose we want to partition the resulting points into k groups, then one simple heuristic is the following, called k-center: Map each vertex to its projection onto the first k eigenvectors of the Laplacian (normalized or unnormalized). Start with any vertex; place it in partition 1. Take the vertex furthest away from it and place it in partition 2. Then take the vertex whose imum distance to the first two vertices is largest and place it in partition 3, and so on till we have populated k partitions with one vertex each. Call these vertices, one per partition, the centers of the partitions. For the remaining vertices, place each of them in that partition whose center is closest to it. This gives a partition of the n vertices of a graph into k groups. In practice, such schemes work extremely well and is called the spectral clustering algorithm. For k = 2, this essentially takes the second eigenvector, makes the two extreme vertices as centers of the partitions, and places each vertex in that partition whose center is closer to it. This heuristic is worse than the algorithm that achieves Cheeger s inequality, but is simpler than computing the conductance of n 1 cuts.