Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Similar documents
Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Graph Partitioning Using Random Walks

Spectral Clustering. Zitao Liu

MATH 567: Mathematical Techniques in Data Science Clustering II

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Spectral Clustering on Handwritten Digits Database

Spectral Clustering. Guokun Lai 2016/10

MATH 567: Mathematical Techniques in Data Science Clustering II

Spectral clustering. Two ideal clusters, with two points each. Spectral clustering algorithms

8.1 Concentration inequality for Gaussian random matrix (cont d)

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

1 Matrix notation and preliminaries from spectral graph theory

Machine Learning for Data Science (CS4786) Lecture 11

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Spectral Techniques for Clustering

MATH 829: Introduction to Data Mining and Analysis Clustering II

Spectral Theory of Unsigned and Signed Graphs Applications to Graph Clustering. Some Slides

17.1 Directed Graphs, Undirected Graphs, Incidence Matrices, Adjacency Matrices, Weighted Graphs

Data-dependent representations: Laplacian Eigenmaps

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Introduction to Spectral Graph Theory and Graph Clustering

Linear Spectral Hashing

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Data dependent operators for the spatial-spectral fusion problem

1 Matrix notation and preliminaries from spectral graph theory

Spectral Clustering on Handwritten Digits Database Mid-Year Pr

Data Mining and Analysis: Fundamental Concepts and Algorithms

STA141C: Big Data & High Performance Statistical Computing

Spectra of Adjacency and Laplacian Matrices

LECTURE NOTE #11 PROF. ALAN YUILLE

arxiv:quant-ph/ v1 22 Aug 2005

Markov Chains and Spectral Clustering

Statistical Learning Notes III-Section 2.4.3

The Nyström Extension and Spectral Methods in Learning

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Manifold Learning: Theory and Applications to HRI

Nonlinear Dimensionality Reduction. Jose A. Costa

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Inverse Power Method for Non-linear Eigenproblems

Statistical Machine Learning

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Unsupervised dimensionality reduction

Lecture: Local Spectral Methods (1 of 4)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

1 Adjacency matrix and eigenvalues

Non-linear Dimensionality Reduction

An indicator for the number of clusters using a linear map to simplex structure

EECS 275 Matrix Computation

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

Lecture: Some Statistical Inference Issues (2 of 3)

Learning gradients: prescriptive models

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

Communities, Spectral Clustering, and Random Walks

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Linear algebra and applications to graphs Part 1

Lecture 10: Dimension Reduction Techniques

Dissertation Defense

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Spectral Theory of Unsigned and Signed Graphs Applications to Graph Clustering: a Survey

Limits of Spectral Clustering

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Non-convex Robust PCA: Provable Bounds

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

Spectral Algorithms I. Slides based on Spectral Mesh Processing Siggraph 2010 course

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Convex sets, conic matrix factorizations and conic rank lower bounds

A spectral clustering algorithm based on Gram operators

Convex Optimization of Graph Laplacian Eigenvalues

Lecture 13: Spectral Graph Theory

Math 307 Learning Goals. March 23, 2010

Laplacian eigenvalues and optimality: II. The Laplacian of a graph. R. A. Bailey and Peter Cameron

A lower bound for the Laplacian eigenvalues of a graph proof of a conjecture by Guo

Diffuse interface methods on graphs: Data clustering and Gamma-limits

Manifold Coarse Graining for Online Semi-supervised Learning

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

Self-Calibration and Biconvex Compressive Sensing

Spectral Clustering. by HU Pili. June 16, 2013

Data Mining and Analysis: Fundamental Concepts and Algorithms

Global Positioning from Local Distances

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Graphs in Machine Learning

Chapter 4. Signed Graphs. Intuitively, in a weighted graph, an edge with a positive weight denotes similarity or proximity of its endpoints.

Transcription:

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Shuyang Ling Courant Institute of Mathematical Sciences, NYU Aug 13, 2018 Joint work with Prof. Thomas Strohmer at UC Davis Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 1 / 43

Outline Motivation: data clustering K-means and spectral clustering A graph cut perspective of spectral clustering Convex relaxation of ratio cuts and normalized cuts Theory and applications Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 2 / 43

Data clustering and unsupervised learning Question: Given a set of N data points in R d, how to partition them into k clusters based on the similarity? Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 3 / 43

K-means clustering K-means Cluster the data by minimizing the k-means objective function: min {Γ l } k l=1 k {}}{ 1 x i x i 2 Γ l i Γ l i Γ }{{ l } within-cluster sum of squares l=1 centroid where {Γ l } k l=1 is a partition of {1,, N}. Widely used in vector quantization, unsupervised learning, Voronoi tessellation, etc. An NP-hard problem, even if d = 2. [Mahajan, etc 09] Heuristic method: Lloyd s algorithm [Lloyd 82] Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 4 / 43

Limitation of k-means Limitation of k-means K-means only works for datasets with individual clusters: isotropic and within convex boundaries well-separated Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 5 / 43

Kernel k-means and nonlinear embedding Goal: map the data into a feature space so that they are well-separated and k-means would work. ϕ: nonlinear map How: locally-linear embedding, isomap, multidimensional scaling, Laplacian eigenmaps, diffusion maps, etc. Focus: We will focus on Laplacian eigenmaps. Spectral clustering consists of Laplacian eigenmaps followed by k-means clustering. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 6 / 43

Graph Laplacian Suppose {x i } N i=1 Rd and construct a similarity (weight) matrix W via w ij := exp ( x i x j 2 ) 2σ 2, W R N N, where σ controls the size of neighborhood. In fact, W represents a weighted undirected graph. Definition of graph Laplacian The (unnormalized) graph Laplacian associated to W is L = D W where D = diag(w 1 N ) is the degree matrix. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 7 / 43

Properties of graph Laplacian The (unnormalized) graph Laplacian associated to W is L = D W, D = diag(w 1 N ). Properties L is positive semidefinite, z Lz = i<j w ij (z i z j ) 2. 1 N is in the null space of L, i.e., λ 1 (L) = 0. λ 2 (L) > 0 if and only if the graph is connected. The dimension of null space equals the number of connected components. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 8 / 43

Laplacian eigenmaps and k-means Laplacian eigenmaps For the graph Laplacian L, we let the Laplacian eigenmap be ϕ(x 1 ). := [u 1,, u k ] R N k }{{} ϕ(x N ) U where {u l } k l=1 are the eigenvectors w.r.t. the k smallest eigenvalues. In other words, ϕ maps data in R d to R k, a coordinate in terms of eigenvectors: ϕ : x }{{} i ϕ(x i ) }{{} R d R k. Then we apply k-means to {ϕ(x i )} N i=1 to perform clustering. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 9 / 43

Algorithm of spectral clustering based on graph Laplacian Unnormalized spectral clustering 1 Input: Given the number of clusters k and a dataset {x i } N i=1, construct the similarity matrix W from {x i } N i=1. Compute the unnormalized graph Laplacian L = D W Compute the eigenvectors {u l } k l=1 of L w.r.t. the smallest k eigenvalues. Let U = [u 1, u 2,, u k ] R N k. Perform k-means clustering on the rows of U by using Lloyd s algorithm. Obtain the partition based on the outcome of k-means. 1 For more details, see an excellent review by [Von Luxburg, 2007] Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 10 / 43

Another variant of spectral clustering Normalized spectral clustering Input: Given the number of clusters k and a dataset {x i } N i=1, construct the similarity matrix W from {x i } N i=1. Compute the normalized graph Laplacian L sym = I N D 1 2 WD 1 2 = D 1 2 LD 1 2 Compute the eigenvectors {u l } k l=1 of L sym w.r.t. the smallest k eigenvalues. Let U = [u 1, u 2,, u k ] R N k. Perform k-means clustering on the rows of D 1 2 U by using Lloyd s algorithm. Obtain the partition based on the outcome of k-means. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 11 / 43

Comments on spectral clustering Pros and Cons of spectral clustering Pros: Cons: Spectral clustering enjoys high popularity and conveniently applies to various settings. Rich connections to random walk on graph, spectral graph theory, and differential geometry. Rigorous justifications of spectral clustering are still lacking. The two-step procedures complicate the analysis, e.g. how to analyze the performance of Laplacian eigenmaps and the convergence analysis of k-means? Our goal: we take a different route by looking at the convex relaxation of spectral clustering to understand its performance better. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 12 / 43

A graph cut perspective Key observation: The matrix W is viewed as a weight matrix of a graph with N vertices. Partitioning the dataset into k clusters is equivalent to finding a k-way graph cut such that any pair of induced subgraphs is not well-connected. Graph cut The cut is defined as the weight sum of edges whose two ends are in different subsets, cut(γ, Γ c ) := i Γ,j Γ c w ij where Γ is a subset of vertices and Γ c is its complement. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 13 / 43

A graph cut perspective Graph cut The cut is defined as the weight sum of edges whose two ends are in different subsets, cut(γ, Γ c ) := i Γ,j Γ c w ij where Γ is a subset of vertices and Γ c is its complement. However, minimizing cut(γ, Γ c ) may not lead to satisfactory results since it is more likely to get an imbalanced cut. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 14 / 43

RatioCut RatioCut The ratio cut of {Γ a } k a=1 is given by In particular, if k = 2, RatioCut({Γ a } k a=1) = k a=1 RatioCut(Γ, Γ c ) = cut(γ, Γc ) Γ cut(γ a, Γ c a). Γ a + cut(γ, Γc ) Γ c. But, it is worth noting minimizing RatioCut is NP-hard. A possible solution is to relax!! Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 15 / 43

RatioCut and graph Laplacian Let 1 Γa ( ) be an indicator vector which maps a vertex to a vector in R N via { 1, l Γ a, 1 Γa (l) = 0, l / Γ a. Relating RatioCut to graph Laplacian There holds cut(γ a, Γ c a) = L, 1 Γa 1 Γ a = 1 Γ a L1 Γa where RatioCut({Γ a } k a=1) = X rcut := k a=1 k a=1 1 L, 1 Γa 1 Γ Γ a a = L, X rcut, 1 Γ a 1 Γ a 1 Γ a a block-diagonal matrix Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 16 / 43

Spectral relaxation of RatioCut Spectral clustering is a relaxation by these two properties, X rcut = UU, U U = I k, U R N k. Spectral relaxation of RatioCut Substituting X rcut = UU results in min L, UU s.t. U U = I k, U R N k whose global minimizer is easily found via computing the eigenvectors w.r.t. the k smallest eigenvalues of the graph Laplacian L. The spectral relaxation gives exactly the first step of unnormalized spectral clustering. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 17 / 43

Normalized cut For normalized spectral clustering, we consider NCut({Γ a } k a=1) := where vol(γ a ) = 1 Γ a D1 Γa. Therefore, k a=1 cut(γ a, Γ c a) vol(γ a ) NCut({Γ a } k a=1) = L sym, X ncut. Here L sym = D 1 2 LD 1 2 is the normalized Laplacian and X ncut := k a=1 1 1 Γ a D1 Γa D 1 2 1Γa 1 Γ a D 1 2. By relaxing X ncut = UU, it gives the spectral relaxation of normalized graph Laplacian. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 18 / 43

Performance bound via matrix perturbation argument Let W (a,a) be the weight matrix of the a-th cluster and define W (1,1) 0 W iso :=....., W δ = W W iso 0 W (k,k) Decomposition of W We decompose the original graph into two subgraphs: W = The corresponding degree matrix diagonal blocks off-diagonal {}}{{}}{ W }{{ iso + W }}{{ δ } k connected components k-partite graph D iso := diag(w iso 1 N ), D δ := diag(w δ 1 N ) where D iso : inner-cluster degree matrix; D δ : outer-cluster degree matrix. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 19 / 43

Spectral clustering The unnormalized graph Laplacian associated to W iso is L iso := D iso W iso a block-diagonal matrix. There holds λ l (L iso ) = 0, 1 l k, λ k+1 (L iso ) = min λ 2 (L (a,a) iso ) > 0. What happens if the graph has k connected components? The nullspace of L iso is spanned by k indicator vectors in R N, i.e., the columns of U iso, 1 n1 1 n1 0 0 0 1 U iso := n2 1 n2 0...... RN k, UisoU iso = I k. 1 0 0 nk 1 nk However, this is not the case if the graph is fully connected. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 20 / 43

Matrix perturbation argument Hope: If L is close to L iso, so is U to U iso. Davis-Kahan sin θ theorem Davis-Kahan perturbation: where min U U isor R O(3) 2 L Liso λ k+1 (L iso ) L L iso 2 D δ, λ k+1 (L iso ) = min λ 2 (L (a,a) iso ) > 0. In other words, if D δ is small, i.e., the difference between L and L iso is small, and λ k+1 (L iso ) > 0, the eigenspace U should be close to U iso. But it does not imply the underlying membership. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 21 / 43

Convex relaxation of ratio cuts Let s review the RatioCut, RatioCut({Γ a } k a=1) = k a=1 1 L, 1 Γ a Γa 1 Γ a = L, X rcut, where X rcut := k a=1 1 Γ a 1 Γ a 1 Γ a. Question: What constraints does X rcut satisfy for any given {Γ a } k a=1? Convex sets Given a partition {Γ a } k a=1, the corresponding X rcut satisfies X rcut is positive semidefinite, X rcut 0; X rcut is nonnegative, X rcut 0 entrywisely; the constant vector is an eigenvector of Z: X rcut 1 N = 1 N ; the trace of X rcut equals k, i.e., Tr(X rcut ) = k. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 22 / 43

Convex relaxation of normalized cuts RatioCut-SDP - SDP relaxation of RatioCut We relax the originally combinatorial optimization by min L, Z s.t. Z 0, Z 0, Tr(Z) = k, Z1 N = 1 N. The major difference from spectral relaxation is the nonnegativity constraint. NCut-SDP - SDP relaxation of normalized cut The corresponding convex relaxation of normalized cut a is min L sym, Z s.t. Z 0, Z 0, Tr(Z) = k, ZD 1 2 1N = D 1 2 1N. a A similar relaxation was proposed by [Xing, Jordan, 2003] Question: How well do these two convex relaxations work? Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 23 / 43

Intuitions From the plain perturbation argument, min U U isor 2 2 Dδ R O(3) λ k+1 (L iso ) where λ k+1 (L iso ) = min λ 2 (L (a,a) ) > 0 if each cluster is connected. What the perturbation argument tells us: iso The success of spectral clustering depends on two ingredients: Within-cluster connectivity: λ 2 (L (a,a) iso )), a.k.a. algebraic connectivity or Fiedler eigenvalue, or equivalently, λ k+1 (L iso ). Inter-cluster connectivity: the operator norm of D δ quantifies the maximal outer-cluster degree. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 24 / 43

Main theorem for RatioCut-SDP Theorem (Ling, Strohmer, 2018) The SDP relaxation gives X rcut as the unique global minimizer if D δ < λ k+1(l iso ), 4 where λ k+1 (L iso ) is the (k + 1)-th smallest eigenvalue of the graph Laplacian L iso. Here λ k+1 (L iso ) satisfies λ k+1 (L iso ) = min 1 a k λ 2(L (a,a) iso ) where λ 2 (L (a,a) iso ) is the second smallest eigenvalue of graph Laplacian w.r.t. the a-th cluster. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 25 / 43

A random walk perspective of normalized cut For normalized cut, we first consider the Markov transition matrices on the whole dataset and each individual cluster: P := D 1 W, P (a,a) iso = (D (a,a) ) 1 W (a,a). Two factors Within-cluster connectivity: the smaller the second largest eigenvalue a of P (a,a) iso is, the stronger connectivity of a-th cluster is. Inter-cluster connectivity: let P δ be the off-diagonal parts of P. P δ is the largest probability of a random walker moving out of its own cluster after one step. a This eigenvalue governs the mixing time of Markov chain. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 26 / 43

Main theorem for NCut-SDP Theorem (Ling, Strohmer, 2018) The SDP gives X ncut as the unique global minimizer if Note that P δ < min λ 2(I na P (a,a) 1 P δ 4 P δ 1 P δ iso ) is small if a random walker starting from any node is more likely to stay in its own cluster than leave it after one step, and vice versa. min λ 2 (I na P (a,a) iso ) is the smallest eigengap of Markov transition matrix defined across all the clusters. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 27 / 43

Main contribution of our result Our contribution Purely deterministic. Exact recovery under natural and mild condition. No assumptions needed on the random data generative model. General and applicable to various settings. Easily verifiable criteria for the global optimality of a graph cut under ratio cut and normalized cut. Not only applies to spectral clustering but also to community detection under stochastic block model. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 28 / 43

Example 1: two concentric circles We consider [ cos( 2πi x 1,i = n ) ] sin( 2πi n ) where m κn and κ > 1. [ cos( 2πj, 1 i n; x 2,j = κ m ) sin( 2πj m ) ], 1 j m Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 29 / 43

Example 1: two concentric circles Theorem Let the data be given by {x 1,i } n i=1 {x 2,i} m i=1. The bandwidth σ is chosen as 4γ σ = n log( m 2π ). Then SDP recovers the underlying two clusters exactly if ( γ ) κ }{{ 1 } O. n minimal separation The bound is near-optimal since two adjacent points on one each is O(n 1 ) apart. One cannot hope to recover the two clusters if the minimal separation is smaller than O(n 1 ). Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 30 / 43

Example 2: two parallel lines Suppose the data points are distributed on two lines with separation, [ ] [ ] x 1,i = 2 i 1, x 2,i = 2 i 1, 1 i n. n 1 n 1 Which one partition is given by k-means objective function? 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 Each cluster is highly anisotropic. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 31 / 43

Example 2: two parallel lines K-means does not work if < 0.5. Theorem Let the data be given by {x 1,i } n i=1 {x 2,i} n i=1. Use the heat kernel with bandwidth γ σ = (n 1) log( n γ > 0. π ), Assume the separation satisfies O ( γ n ). Then SDP recovers the underlying two clusters exactly. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 32 / 43

Example 3: community detection under stochastic block model The adjacency matrix is a binary random matrix: 1 if member i and j are in the same community, Pr(w ij = 1) = p and Pr(w ij = 0) = 1 p; 2 if member i and j are in different communities, Pr(w ij = 1) = q and Pr(w ij = 0) = 1 q. Let p > q and we are interested in when SDP relaxation is able to recover the underlying membership. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 33 / 43

Graph cuts and stochastic block model Theorem Let p = α log n n and q = β log n n. The RatioCut-SDP recovers the underlying communities exactly if ( 2 α > 26 3 + β 2 + ) β with high probability. Compared with the state-of-the-art results where α > 2 + β + 2 2β is needed for exact recovery 2, our performance bound is slightly looser by a constant factor. 2 [Abbe, Bandeira, Hall, 2016] Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 34 / 43

Numerics - random instances So far, the two toy examples are purely deterministic and it is of course more interesting to look at random data. However, this task is nontrivial which leads to a few open problems. Therefore we turn to simulation to see how many random instances satisfy the conditions in our theorem and how they depends on the choice of σ and minimal separation. RatioCut-SDP: NCut-SDP: D δ λ k+1(l iso ), 4 P δ min λ 2(I na P (a,a) iso ). 1 P δ 4 Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 35 / 43

Numerics for two concentric circles Exact recovery is guaranteed empirically if 0.2, 5 p 75 8, and σ = n 1 p. 1 0.8 = 0.1 = 0.15 = 0.2 = 0.25 = 0.3 1 0.8 = 0.1 = 0.15 = 0.2 = 0.25 = 0.3 0.6 0.6 0.4 0.4 0.2 0.2 0 5 10 15 20 25 0 5 10 15 20 25 Figure: Two concentric circles with radii r 1 = 1 and r 2 = 1 +. The smaller circle has n = 250 uniformly distributed points and the larger one has m = 250(1 + ) where = r 2 r 1. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 36 / 43

Numerics for two lines Exact recovery is guaranteed empirically if 0.05 and 2 p 150 4. 1 = 0.025 = 0.05 = 0.075 = 0.1 1 = 0.025 = 0.05 = 0.075 = 0.1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Figure: Two parallel line segments of unit length with separation and 250 points are sampled uniformly on each line. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 37 / 43

Numerics - stochastic ball model 1.5 1 0.5 0-0.5-1 -1.5-2 -1.5-1 -0.5 0 0.5 1 1.5 2 Stochastic ball model with two clusters and the separation between the centers is 2 +. Each cluster has 1000 uniformly distributed points Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 38 / 43

Numerics: stochastic ball model 1 if n = 250, we require 0.2, σ = p 5 and 2 p 40 4; n 2 if n = 1000, we require 0.1, σ = p 5, and 2 p 60 2. n 1 0.8 = 0.1 = 0.15 = 0.2 = 0.25 = 0.3 1 0.8 = 0.1 = 0.125 = 0.15 = 0.175 = 0.2 0.6 0.6 0.4 0.4 0.2 0.2 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Performance of NCut-SDP for 2D stochastic ball model. Left: each ball has n = 250 points; Right: each ball contains n = 1000 points. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 39 / 43

Comparison with k-means For stochastic ball model, k-means should be the best choice since each cluster is isotropic and separated. However, we have shown that exact recovery via k-means SDP is impossible 3 if 3 2 1 0.2247. On the other hand, the SDP relaxation of RatioCut and NCut still works 1 if n = 250, we require 0.2, σ = p 5 and 2 p 40 4; n 2 if n = 1000, we require 0.1, σ = p 5, and 2 p 60 2. n 3 See arxiv:1710.06008, by Li, etc. Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 40 / 43

Outlook for random data model Open problem Suppose there are n data points drawn from a probability density function p(x) supported on a manifold M. How can we estimate the second smallest eigenvalue of the graph Laplacian (either normalized or unnormalized) given the kernel function Φ and σ? Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 41 / 43

A possible solution to the open problem It is well known that the graph Laplacian will converge to the Laplace-Beltrami operator on the manifold (pointwisely and spectral convergence). In other words, if we know the second smallest eigenvalue of Laplace-Beltrami operator manifold and convergence rate from graph Laplacian to its continuum limit, we can have a lower bound of the second smallest eigenvalue of graph Laplacian. It require tools from empirical process, differential geometry,... Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 42 / 43

Conclusion and future works We establish a systematic framework to certify the global optimality of a graph cut under ratio cut and normalized cut. The performance guarantee is purely algebraic and deterministic, only depending on the algebraic properties of graph Laplacian. It may lead to a novel way to look at spectral graph theory. How to estimate the Fiedler eigenvalue of graph Laplacian associated to a random data set? Can we derive an analogue theory for directed graph? Find a faster and scalable solver for SDP and how to analyze the classical spectral clustering algorithm? Preprint: Certifying global optimality of graph cuts via semidefinite relaxation: a performance guarantee for spectral clustering, arxiv:1806.11429. Thank you! Shuyang Ling (New York University) Data Science Seminar, HKUST Aug 13, 2018 43 / 43