Dissertation Defense

Similar documents
A Simple Algorithm for Clustering Mixtures of Discrete Distributions

Clustering Algorithms for Random and Pseudo-random Structures

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Norms of Random Matrices & Low-Rank via Sampling

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

1 Matrix notation and preliminaries from spectral graph theory

1 Adjacency matrix and eigenvalues

Lecture 13: Spectral Graph Theory

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 12 : Graph Laplacians and Cheeger s Inequality

arxiv: v5 [math.na] 16 Nov 2017

8.1 Concentration inequality for Gaussian random matrix (cont d)

Random matrices: Distribution of the least singular value (via Property Testing)

Graph Partitioning Using Random Walks

Lecture Notes 1: Vector spaces

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

Expander Construction in VNC 1

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Random Lifts of Graphs

Spectral and Electrical Graph Theory. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Rank minimization via the γ 2 norm

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Statistical Machine Learning

Methods for sparse analysis of high-dimensional data, II

1 Tridiagonal matrices

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

AM205: Assignment 2. i=1

Spectral Clustering. Guokun Lai 2016/10

Positive Semi-definite programing and applications for approximation

Invariant Subspace Perturbations or: How I Learned to Stop Worrying and Love Eigenvectors

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

Random Matrices: Invertibility, Structure, and Applications

Robust Principal Component Analysis

Sparsification by Effective Resistance Sampling

Methods for sparse analysis of high-dimensional data, II

BALANCING GAUSSIAN VECTORS. 1. Introduction

1 Matrix notation and preliminaries from spectral graph theory

Info-Greedy Sequential Adaptive Compressed Sensing

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

COMMUNITY DETECTION IN SPARSE NETWORKS VIA GROTHENDIECK S INEQUALITY

Graph Partitioning Algorithms and Laplacian Eigenvalues

Spectral Partitiong in a Stochastic Block Model

Inverse Power Method for Non-linear Eigenproblems

Iterative solvers for linear equations

Reconstruction in the Generalized Stochastic Block Model

Singular value decomposition (SVD) of large random matrices. India, 2010

Graph Sparsification I : Effective Resistance Sampling

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Parallel Singular Value Decomposition. Jiaxing Tan

Pattern Recognition 2

Problem Set 2. Assigned: Mon. November. 23, 2015

Sparse PCA in High Dimensions

ORIE 6334 Spectral Graph Theory October 13, Lecture 15

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

1 Regression with High Dimensional Data

Randomized Algorithms

Lecture 15: Expanders

Geometry of log-concave Ensembles of random matrices

CS281 Section 4: Factor Analysis and PCA

On the concentration of eigenvalues of random symmetric matrices

Using Friendly Tail Bounds for Sums of Random Matrices

Notes on Gaussian processes and majorizing measures

Invertibility of random matrices

Fiedler s Theorems on Nodal Domains

Graph Clustering Algorithms

Fundamentals of Matrices

Lecture 6: Random Walks versus Independent Sampling

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

THE SZEMERÉDI REGULARITY LEMMA AND ITS APPLICATION

Fiedler s Theorems on Nodal Domains

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

3 Best-Fit Subspaces and Singular Value Decomposition

Bounded Arithmetic, Expanders, and Monotone Propositional Proofs

NORMS ON SPACE OF MATRICES

Estimation of large dimensional sparse covariance matrices

Convex and Semidefinite Programming for Approximation

Random Methods for Linear Algebra

ECS231: Spectral Partitioning. Based on Berkeley s CS267 lecture on graph partition

Clustering and Gaussian Mixture Models

On the Spectra of General Random Graphs

Lecture 10 February 4, 2013

Laplacian Matrices of Graphs: Spectral and Electrical Theory

Small Ball Probability, Arithmetic Structure and Random Matrices

Combinatorial Optimization

Extreme eigenvalues of Erdős-Rényi random graphs

Probability Background

Spectral Clustering on Handwritten Digits Database

Linear algebra for computational statistics

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

The Informativeness of k-means for Learning Mixture Models

Randomized Algorithms

Lecture 5: January 30

Lecture 2: Review of Basic Probability Theory

Transcription:

Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation Defense April 23, 2008 1 / 46

Committee Ravi Kannan (Advisor) Dana Angluin Dan Spielman Mike Mahoney (Yahoo!) Mitra (Yale University) Dissertation Defense April 23, 2008 2 / 46

Outline 1 Introduction: Clustering and Spectral algorithms 2 Four results: 1 Clustering using Bi-partitioning 2 Clustering in sparse graphs 3 A Robust clustering algorithm 4 An Entrywise notion for spectral stability 3 Future work Mitra (Yale University) Dissertation Defense April 23, 2008 3 / 46

Clustering What is clustering? Issues Given a set of objects S, partition into disjoint sets or clusters S 1... S k Partitioning done according to some notion of closeness, i.e. objects in a cluster S r are close to each other, and far from objects in other clusters. What is the right definition of closeness? Algorithms to find clusters given the right definition Mitra (Yale University) Dissertation Defense April 23, 2008 4 / 46

Clustering: Examples Figure: From Yale face dataset Mitra (Yale University) Dissertation Defense April 23, 2008 5 / 46

Clustering: Matrices Term-document matrices M V C A H K F P CS Doc 1 1 1 1 1 0 0 0 0 CS Doc 2 1 1 1 1 0 0 0 1 CS Doc 3 1 1 0 1 0 0 0 0 Medicine Doc 1 0 0 0 0 1 1 1 1 Medicine Doc 2 1 0 0 0 1 1 0 1 Medicine Doc 3 0 1 0 0 1 1 1 0 M - Microprocessor, V - VirtualMemory, C - L2 Cache, A - Algorithm, H - Hemoglobin, K - Kidney, F - Fracture, P - Painkiller Moral Clustering problems can be modelled as object-feature matrices, objects can be seen as vectors in high dimesional space. Mitra (Yale University) Dissertation Defense April 23, 2008 6 / 46

Mixture models Each cluster is defined by a simple (high-dimensional) probability distribution. Objects are samples from these distributions. Hope Can successfully cluster if centers (means) are far apart. µ 1 µ 2 How large does need to be? Figure: Two circles whose centers are separated Mitra (Yale University) Dissertation Defense April 23, 2008 7 / 46

Random graphs A G n,p random graph is generated by selecting each possible edge with independent probability p Example: G 5,0.5 EA 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 = A 0 1 1 0 0 1 0 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 Mitra (Yale University) Dissertation Defense April 23, 2008 8 / 46

Planted partition model for Graphs Total n vertices, divided in to k clusters T 1, T 2... T k of size n 1... n k There are k(k+1) 2 probabilities P rs (= P sr ) such that if v T r, u T s, the edge e(u, v) is present with probability P rs Mitra (Yale University) Dissertation Defense April 23, 2008 9 / 46

Planted partition model for Graphs P 0.5 0.1 = 0.1 0.5 EA 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 = A 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 µ 1 = {0.5, 0.5, 0.5, 0.1, 0.1, 0.1} µ 2 = {0.1, 0.1, 0.1, 0.5, 0.5, 0.5} Mitra (Yale University) Dissertation Defense April 23, 2008 10 / 46

Algorithmic Issues Heuristic Analysis: Analyze an algorithm known to work in practice. Spectral Algorthms Uses information about the spectrum (Eigenvalues, Eigenvectors, Singular Vectors etc) of the data matrix to do clustering Quite popular, seems to work in practice. Singular values and vectors can be computed efficiently For a matrix A, the span of the top k singular vectors gives A k, the rank k matrix such that for all rank k matrices M A A k A M Mitra (Yale University) Dissertation Defense April 23, 2008 11 / 46

Why might this work? Intuition: Eliminates noise Avoids curse of dimensionality Cheeger s inequality (relation to sparsest cut) Convention Eigen/Singular values are often sorted from largest to smallest (in absolute value). λ 1 λ 2 λ 3... The eigen/singular vector corresponding to λ i is the i th eigen/singular vector Mitra (Yale University) Dissertation Defense April 23, 2008 12 / 46

Why might this work? A = 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 0 Quick Definition: if A is square symmetric, v its eigenvector if v = 1, and Av = λv for some λ (an eigenvalue). Mitra (Yale University) Dissertation Defense April 23, 2008 13 / 46

Why might this work? A = 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 0 1 = {1,... 1} T, A1 = 41 1 is the first eigenvector Av v = {1, 1, 1, 1, 1, 1, 1, 1} T = 2v This is the second eigenvector, and reveals the cluster. Mitra (Yale University) Dissertation Defense April 23, 2008 13 / 46

Previous Work Lot of work: [B 87], [DF 89], [AK 97], [AKS 98], [VW2002]... McSherry 2001 An instance of the planted partition model A with k clusters can be clustered with probability 1 o(1) if the following separation condition holds (centers are far apart): for all r s µ r µ s 2 cσ 2 log n where σ 2 = max rs P rs, n = number of vertices Assumption: σ 2 log6 n n, i.e. atleast polylog degree Spectral method: Take best rank-k approximation A k, do greedy on that matrix. This gives approximate clustering Clean-up: Use combinatorial projections, ie counting edges to approximate partitions. Mitra (Yale University) Dissertation Defense April 23, 2008 14 / 46

Our Contributon Clustering by Recursive Bi-partitioning: use the second singular vector to bi-partition the data. repeat. Pseudo-random models of Clustering: used to model cluster problems for sparse (constant-degree) graphs. Rotationally invariant algorithms: Remove combinatorial/ad-hoc techniques for discrete distributions. Entrywise bounds for Eigenvectors: A different notion of spectral stability for random graphs. Mitra (Yale University) Dissertation Defense April 23, 2008 15 / 46

Spectral Clustering by Recursive Bi-partitioning Mitra (Yale University) Dissertation Defense April 23, 2008 16 / 46

Spectral Clustering by Recursive Bi-partitioning Joint work with Dasgupta, Hopcroft and Kannan (ESA 2006) Goal Instead of a rank-k approximation based method, use a incremental algorithm that bi-partitions the data at each step. Result Clustering possible if for all r s where σ 2 r = max s P rs log6 n n µ r µ s 2 c(σ r + σ s ) 2 log n Mitra (Yale University) Dissertation Defense April 23, 2008 17 / 46

Basic Step Given A, find vector v 1 of AJ that maximizes AJv 1 where J = I 1 n 11T Sort entries of v 1 : v 1 (1) v 1 (2)... v 1 (n) Find i, i + 1 such that v 1 (i) v 1 (i + 1) is largest Return {1,... i} and {i + 1,... n} as the bi-partition Definition refresher v 1 is the first right singular vector of AJ, and close to the second right singular vector of A Mitra (Yale University) Dissertation Defense April 23, 2008 18 / 46

Main algorithm Given A, randomly partition the rows into t = 4 log n parts B i (i = 1 to t) of equal size Bi-partition the (same) columns t times using Basic Step (last slide) Combine these (approximate) bi-partitions to find an accurate bi-partition Mitra (Yale University) Dissertation Defense April 23, 2008 19 / 46

Analysis Let s focus on one B i. Call it B. Let B = E(B) v 1 (BJ) is almost structured Let v 1 = v 1 (BJ). Then, v 1 = r α r g (r) + v where g (r) is the characteristic vector of T r, v is orthogonal to each g (r) and v 1 c 2 Mitra (Yale University) Dissertation Defense April 23, 2008 20 / 46

Analysis Let s focus on one B i. Call it B. Let B = E(B) v 1 (BJ) is almost structured Let v 1 = v 1 (BJ). Then, Furedi-Komlos 81 if σ 2 log6 n n B B 3σ n v 1 = r α r g (r) + v where g (r) is the characteristic vector of T r, v is orthogonal to each g (r) and v 1 c 2 BJ BJv + BJv BJ v + BJv + BJ BJ v BJ (1 v 2 ) + B B v Using (1 x) 1 x 2 1 2 v B B 1 BJ c 2 Mitra (Yale University) Dissertation Defense April 23, 2008 20 / 46

Analysis v 1 = v + v, v = α r g (r) Claim When sorted, there is a Ω(1) gap in the α s. v is orthogonal to 1. This implies, αr 0 On the other hand, 1 = v 1 2 = v 2 + v 2 r α 2 r + 1 c 2 2 r α 2 r 1 2 v looks like this. Combines to prove the existence of a Ω(1) gap. Mitra (Yale University) Dissertation Defense April 23, 2008 21 / 46

Analysis v 1 = v + v, v = α r g (r) Claim No more than n min c 3 vertices cross the gap. (n min = min r n r ) Implied by the fact that v is small. Ω(1) gap in α 1 4 n min gap in v. Let there are m vertices that cross the gap. Then, 1 m v 2 1 16n min c2 2 m 16n min c 2 2 v 1 looks like this. Mitra (Yale University) Dissertation Defense April 23, 2008 21 / 46

Combining the 4 log n bi-partitions We showed No more than n min c 3 gap. vertices cross the Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Mitra (Yale University) Dissertation Defense April 23, 2008 22 / 46

Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 being misclassified. probability of Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Mitra (Yale University) Dissertation Defense April 23, 2008 22 / 46

Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Need to show: No two vertices from the same cluster can be put in different components. We find at least two components. Mitra (Yale University) Dissertation Defense April 23, 2008 22 / 46

Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Clean clusters No two vertices from the same cluster can be put in different components. Let u, v T r. Vertex v is on the right side of the bi-partition (1 ɛ) fraction of cases. Same is true for u. So u and v on the same side at least (1 2ɛ) fraction of cases. Mitra (Yale University) Dissertation Defense April 23, 2008 22 / 46

Combining the 4 log n bi-partitions Equivalently A vertex has ɛ = 1 c 3 probability of being misclassified. Given 4 log n, bi-partitions, construct a graph on the vertices thus: For each u, v [n] set e(u, v) = 1 if vertices u and v are on the same side of the bi-partition in 1 2ɛ fraction of cases. Find connected components in the graph. Return them as a (bi-)partition. Nontrivial Partitions We find at least two components. A counting argument. Mitra (Yale University) Dissertation Defense April 23, 2008 22 / 46

Pseudo-randomness and Clustering Mitra (Yale University) Dissertation Defense April 23, 2008 23 / 46

Sparse graphs? Goal Design a model that will allow constant-degree graphs. Problems Standard condition: µ r µ s 2 cσ 2 log n. A planted partition model with σ 2 = Θ( d n ) for constant d will have vertices with logarithmic degree. Our Result We Introduce a model where clustering possible if: for constant α µ r µ s 2 c α2 n log2 α Mitra (Yale University) Dissertation Defense April 23, 2008 24 / 46

Solution: Use pseudo-randomness A graph G(V, E) is (p, α) pseudo-random if for all A, B V e(a, B) p A B α A B Theorem A G n,p random graph is (p, 2 np) pseudo-random (p log6 n n ) Proof: E(e(A, B)) = p A B. Using Chernoff Bound, P( e(a, B) E(e(A, B)) > 2 np A B ) exp( 2n) But there are only 2 n.2 n = 2 2n pairs of sets A, B. The claim follows. Intuition Pseudo-random graphs are deterministic versions of random-graphs Mitra (Yale University) Dissertation Defense April 23, 2008 25 / 46

The model Graph G, k clusters T r, r [k] For some α, and for each r, s [k], there is p rs such that: G(T r, T s ) is (p rs, α) pseudo-random. Also, e(x, T s ) p rs T s 2α if x T r Algorithmic issue: Furedi-Komlos doesn t apply, and there is no independence! Ā 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 0.1 0.1 0.1 0.5 0.5 0.5 A 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 Mitra (Yale University) Dissertation Defense April 23, 2008 26 / 46

Rotationally Invariant Algorithm for Discrete Distributions Mitra (Yale University) Dissertation Defense April 23, 2008 27 / 46

Discrete vs. Continuous Similar results can be proved for discrete and continuous models µ r µ s 2 Ω (σ 2 log n) The algorithms: 1 Share the spectral part that gives an approximation 2 Differ in clean-up phase continuous models seems to have more natural algorithms Mixture of Gaussians: k high dimensional gaussians with centers µ r, r = 1 to k. pdf for the k-th cluster/gaussian: f r (x) ( exp 1 ) 2 (x µ r ) Σ 1 r (x µ r ) Σ r is the covariance matrix. Mitra (Yale University) Dissertation Defense April 23, 2008 28 / 46

Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, 2008 29 / 46

Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models Simplicity One shot distance-based or projection-based algorithm, instead of combinatorial, incremental or sampling techniques. McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, 2008 29 / 46

Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant Natural assumption If the vectors are rotated, the clustering remains the same. 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, 2008 29 / 46

Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models Extension Simpler algorithms are easier to adapt: models without complete independence, without block structuring. McSherry 2001 Conjecture: Such an algorithm exists. Mitra (Yale University) Dissertation Defense April 23, 2008 29 / 46

Discrete vs. Continuous We would like an algorithm 1 Simple, natural clean-up phase 2 Rotationally invariant 3 Easily extensible to more complex models McSherry 2001 Conjecture: Such an algorithm exists. Our result The conjecture is true. Theorem Consider a matrix generated from a discrete mixture model with k- clusters, m objects and n features. Clustering possible if: ( µ r µ s 2 cσ 2 1 + n ) log m m Mitra (Yale University) Dissertation Defense April 23, 2008 29 / 46

Our algorithm Cluster(A, k) Divide A into A 1 and A 2 {µ r } = Centers(A 1, k) Project (A 2, µ 1,... µ k ) Project(A 2, µ 1,... µ k ) Group v A 2 with the µ r that minimizes v µ r Centers(A 1, k) Uses a spectral algorithm to find approximate clusters P r, r [k]. Returns empirical centers µ r = 1 P r v P r v Mitra (Yale University) Dissertation Defense April 23, 2008 30 / 46

Analysis Lemma Proof idea: µ r µ r 2 c 1 σ 2 ( 1 + n m µ r = 1 p r v P r v p r µ r = v P r v = P r v + s v Q rs p r (µ r µ r ) = Pr (v µ r ) + s Q rs (v µ r ) ) 1 20 µ r µ s 2 (1) Spectral method returns a approximately correct partition. P r = correctly classified part of P r, p r = P r, p r = P r Q rs = should be in P s, placed in P r, q rs = Q rs Mitra (Yale University) Dissertation Defense April 23, 2008 31 / 46

Analysis p r (µ r µ r ) = Pr (v µ r ) + s Q rs (v µ r ) Need to bound Pr (v µ r ) and for all s (v µ r ) (v µ s ) + q rs µ s q rs µ r Q rs Q rs q rs µ s q rs µ r = q rs µ s µ r Turns out q rs decreases if µ s µ r increases, cancels each other out. Pr (v µ r ) follows from an argument based on a spectral norm bound (ala Furedi-Komlos). Mitra (Yale University) Dissertation Defense April 23, 2008 32 / 46

Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) (u µ r ) (µ r µ s ) is small by separation assumption. Mitra (Yale University) Dissertation Defense April 23, 2008 33 / 46

Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) δ r (µ s µ r ) δ r µ s µ r by Cauchy-Schwartz, is small as δ r is small. Mitra (Yale University) Dissertation Defense April 23, 2008 33 / 46

Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) δ r (δ s δ r ) is similarly small. Mitra (Yale University) Dissertation Defense April 23, 2008 33 / 46

Analysis Lemma For each sample u, if u T r, then for all s r (u µ r ) (µ r µ s ) 2 5 µ r µ s 2 Assume µ r = µ r + δ r ; r. Then, (u µ r ) (µ s µ r ) =(u µ r δ r ) (µ s µ r δ r + δ s ) =(u µ r ) (µ r µ s ) δ r (µ s µ r ) δ r (δ s δ r ) + (u µ r ) (δ s δ r ) Main challenge Bounding (u µ r ) (δ s δ r ) Mitra (Yale University) Dissertation Defense April 23, 2008 33 / 46

Completing the proof Claim (u µ r ) (δ s δ r ) < c 3 σ 2 (1 + n m ) log m Proof idea: (u µ r ) δ r = (u(i) µ r (i))δ r (i) = x(i) i [n] i [n] This is a sum of zero mean random variables x(i). E(x(i) 2 ) 2δ r (i) 2 σ 2 i E(x(i) 2 ) 2σ 2 δ r 2 c 3 kσ 4 ( 1 + n m ) x(i) δ i 2c 4 σ 2, because the number of 1 s in a column can be at most 1.1mσ 2. Mitra (Yale University) Dissertation Defense April 23, 2008 34 / 46

Completing the proof So we have a sum of absolutely bounded, zero mean, bounded variance random variables. Can apply: Bernstein s inequality Let {X i } n i=1 be a collection of independent, random variables where Pr { X i M} = 1 i. Then, for any ε 0 { } ( ) n Pr (X i E[X i ]) ε ε 2 exp 2 ( θ 2 + M 3 ε) i=1 where θ = EX 2 i Plugging in our values, Pr{ ( x(i) c 3 σ 2 (1 + n )} m ) + log m 1 m 3 i [n] Mitra (Yale University) Dissertation Defense April 23, 2008 35 / 46

Entrywise Bounds for Eigenvectors of Random Graphs Mitra (Yale University) Dissertation Defense April 23, 2008 36 / 46

Well studied: l 2 norm bounds Already saw: if A is the adjacency matrix of a G n,p graph Lot of research on similar bounds. A E(A) 3 np v = v 1 (E(A)) = 1 n 1 Question u = v 1 (A) =? Goal Study u v max i [n] u(i) v(i) A potentially useful notion of spectral stability. Mitra (Yale University) Dissertation Defense April 23, 2008 37 / 46

Can l 2 give l? Not directly! The spectral norm bound on A E(A) can be converted to a bound on u v. Best bound we can get u v u v 3 Too weak! 1 np is much larger np 1 that n Mitra (Yale University) Dissertation Defense April 23, 2008 38 / 46

Eigenvector of a Random Graph Figure: G 400,0.2 Mitra (Yale University) Dissertation Defense April 23, 2008 39 / 46

Our result Let A be the adjacency matrix of a G n,p graph, and u = v 1 (A). Then with probability 1 o(1), for all i u(i) = 1 n (1 ± ɛ) log n where ɛ = c log n 2 log np np, p log6 n n, c 2 constant Essentially optimal. Mitra (Yale University) Dissertation Defense April 23, 2008 40 / 46

Proof Only need a few elementary properties. Let = 2 probability, e(i) = np(1 ± ); for all i [n] log n np e(a, B) p A B 2 np A B ; for all A, B λ 1 (A) np. With high Normalize u = v 1 (A) such that max i (u(i)) = u(1) = 1 Au = (np)u (Au)(1) = (np)u(1) i A(1, i)u(i) = np N(1) u(i) = np Claim At least np 2 vertices of N(1) have u(i) 1 2 We know, N(1) u(i) = np Mitra (Yale University) Dissertation Defense April 23, 2008 41 / 46

Proof Only need a few elementary properties. Let = 2 probability, e(i) = np(1 ± ); for all i [n] log n np e(a, B) p A B 2 np A B ; for all A, B λ 1 (A) np. With high If not N(1) u(i) np 2 1 + (np(1 + ) np )(1 2 ) 2 np 2 + (np(1 + ))(1 2 ) 2 np np 2 < np Claim At least np 2 vertices of N(1) have u(i) 1 2 We know, N(1) u(i) = np Mitra (Yale University) Dissertation Defense April 23, 2008 41 / 46

Proof (contd.) Idea Extend the argument to successive neighborhood sets. We define a sequence of sets {S t } for t = 1... S 1 = {1} S t+1 = {i : i N(S t ) and u(i) 1 c(t + 1) } How quickly does S t+1 grow? Lemma Let t be the last index such that S t 2n 3. For all t t S t+1 (np) S t 9t 2 Exponential increase! Mitra (Yale University) Dissertation Defense April 23, 2008 42 / 46

Connection to Clustering Experiments show that for our models, no clean-up is necessary at all. Needed Subtle entrywise bounds for the second (and smaller) eigenvectors for the planted model. Figure: Second eigenvector of a graph with two clusters. Mitra (Yale University) Dissertation Defense April 23, 2008 43 / 46

Connection to Clustering Can show for models with stronger separation conditions Theorem Assume σ 2 = Ω( 1 n ). Then the second eigenvector provides a clean clustering if µ r µ s 2 σ 2/3 log n Stronger than standard assumption σ 2 log n Mitra (Yale University) Dissertation Defense April 23, 2008 44 / 46

Future Work 1 Clustering without clean-up 2 Clustering below the variance bound Ω (σ 2 ) 3 A Chernoff type bound for entrywise error? Algorithmic applications? Mitra (Yale University) Dissertation Defense April 23, 2008 45 / 46

Thanks! Mitra (Yale University) Dissertation Defense April 23, 2008 46 / 46